distances

_bag

abydos.distance._bag.

Bag similarity & distance

class distances._bag.Bag(tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any)

Bases: _TokenDistance

Bag distance.

Bag distance is proposed in :cite:`Bartolini:2002`. It is defined as

\[dist_{bag}(src, tar) = max(|multiset(src)-multiset(tar)|, |multiset(tar)-multiset(src)|)\]

Added in version 0.3.6.

Methods

dist(src, tar)

Return the normalized bag distance between two strings.

dist_abs(src, tar[, normalized])

Return the bag distance between two strings.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar)

Return the token simularity two strings.

__init__(tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any) None

Initialize Bag instance.

Parameters:
tokenizer_Tokenizer

A tokenizer instance from the abydos.tokenizer package

intersection_typestr

Specifies the intersection type, and set type as a result: See intersection_type description in _TokenDistance for details.

**kwargs

Arbitrary keyword arguments

Other Parameters:
qvalint

The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.

metric_Distance

A string distance measure class for use in the soft and fuzzy variants.

thresholdfloat

A threshold value, similarities above which are counted as members of the intersection for the fuzzy variant.

.. versionadded:: 0.4.0
dist(src: str, tar: str) float

Return the normalized bag distance between two strings.

Bag distance is normalized by dividing by \(max( |src|, |tar| )\).

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
float

Normalized bag distance

Examples

>>> cmp = Bag()
>>> cmp.dist('cat', 'hat')
0.3333333333333333
>>> cmp.dist('Niall', 'Neil')
0.4
>>> cmp.dist('aluminum', 'Catalan')
0.625
>>> cmp.dist('ATCG', 'TAGC')
0.0

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

dist_abs(src: str, tar: str, normalized: bool = False) float

Return the bag distance between two strings.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

normalizedbool

Normalizes to [0, 1] if True

Returns:
int or float

Bag distance

Examples

>>> cmp = Bag()
>>> cmp.dist_abs('cat', 'hat')
1
>>> cmp.dist_abs('Niall', 'Neil')
2
>>> cmp.dist_abs('aluminum', 'Catalan')
5
>>> cmp.dist_abs('ATCG', 'TAGC')
0
>>> cmp.dist_abs('abcdefg', 'hijklm')
7
>>> cmp.dist_abs('abcdefg', 'hijklmno')
8

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

_baulieu_xiii

abydos.distance._baulieu_xiii.

Baulieu XIII distance

class distances._baulieu_xiii.BaulieuXIII(alphabet: Counter[str] | Sequence[str] | Set[str] | int | None = None, tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any)

Bases: _TokenDistance

Baulieu XIII distance.

For two sets X and Y and a population N, Baulieu XIII distance :cite:`Baulieu:1997` is

\[dist_{BaulieuXIII}(X, Y) = \frac{|X \setminus Y| + |Y \setminus X|} {|X \cap Y| + |X \setminus Y| + |Y \setminus X| + |X \cap Y| \cdot (|X \cap Y| - 4)^2}\]

This is Baulieu’s 31st dissimilarity coefficient. This coefficient fails Baulieu’s (P4) property, that \(D(a+1,b,c,d) \leq D(a,b,c,d) = 0\) with equality holding iff \(D(a,b,c,d) = 0\).

In 2x2 confusion table terms, where a+b+c+d=n, this is

\[dist_{BaulieuXIII} = \frac{b+c}{a+b+c+a \cdot (a-4)^2}\]

Added in version 0.4.0.

Methods

dist(src, tar)

Return the Baulieu XIII distance of two strings.

dist_abs(src, tar)

Return absolute distance.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar)

Return the token simularity two strings.

__init__(alphabet: Counter[str] | Sequence[str] | Set[str] | int | None = None, tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any) None

Initialize BaulieuXIII instance.

Parameters:
alphabetCounter, collection, int, or None

This represents the alphabet of possible tokens. See alphabet description in _TokenDistance for details.

tokenizer_Tokenizer

A tokenizer instance from the abydos.tokenizer package

intersection_typestr

Specifies the intersection type, and set type as a result: See intersection_type description in _TokenDistance for details.

**kwargs

Arbitrary keyword arguments

Other Parameters:
qvalint

The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.

metric_Distance

A string distance measure class for use in the soft and fuzzy variants.

thresholdfloat

A threshold value, similarities above which are counted as members of the intersection for the fuzzy variant.

.. versionadded:: 0.4.0
dist(src: str, tar: str) float

Return the Baulieu XIII distance of two strings.

Parameters:
srcstr

Source string (or QGrams/Counter objects) for comparison

tarstr

Target string (or QGrams/Counter objects) for comparison

Returns:
float

Baulieu XIII distance

Examples

>>> cmp = BaulieuXIII()
>>> cmp.dist('cat', 'hat')
0.2857142857142857
>>> cmp.dist('Niall', 'Neil')
0.4117647058823529
>>> cmp.dist('aluminum', 'Catalan')
0.6
>>> cmp.dist('ATCG', 'TAGC')
1.0

Added in version 0.4.0.

_character

abydos.tokenizer._character.

Character tokenizer

class distances._character.CharacterTokenizer(scaler: str | Callable[[float], float] | None = None)

Bases: _Tokenizer

A character tokenizer.

Added in version 0.4.0.

Methods

count()

Return token count.

count_unique()

Return the number of unique elements.

get_counter()

Return the tokens as a Counter object.

get_list()

Return the tokens as an ordered list.

get_set()

Return the unique tokens as a set.

tokenize(string)

Tokenize the term and store it.

__init__(scaler: str | Callable[[float], float] | None = None) None

Initialize tokenizer.

Parameters:
scalerNone, str, or function

A scaling function for the Counter:

  • None : no scaling

  • ‘set’ : All non-zero values are set to 1.

  • a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.

.. versionadded:: 0.4.0
tokenize(string: str) CharacterTokenizer

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters:
stringstr

The string to tokenize

Examples

>>> CharacterTokenizer().tokenize('AACTAGAAC')
CharacterTokenizer({'A': 5, 'C': 2, 'T': 1, 'G': 1})

Added in version 0.4.0.

_clement

abydos.distance._clement.

Clement similarity

class distances._clement.Clement(alphabet: Counter[str] | Sequence[str] | Set[str] | int | None = None, tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any)

Bases: _TokenDistance

Clement similarity.

For two sets X and Y and a population N, Clement similarity :cite:`Clement:1976` is defined as

\[sim_{Clement}(X, Y) = \frac{|X \cap Y|}{|X|}\Big(1-\frac{|X|}{|N|}\Big) + \frac{|(N \setminus X) \setminus Y|}{|N \setminus X|} \Big(1-\frac{|N \setminus X|}{|N|}\Big)\]

In 2x2 confusion table terms, where a+b+c+d=n, this is

\[sim_{Clement} = \frac{a}{a+b}\Big(1 - \frac{a+b}{n}\Big) + \frac{d}{c+d}\Big(1 - \frac{c+d}{n}\Big)\]

Added in version 0.4.0.

Methods

dist(src, tar)

Return distance.

dist_abs(src, tar)

Return absolute distance.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar)

Return the Clement similarity of two strings.

__init__(alphabet: Counter[str] | Sequence[str] | Set[str] | int | None = None, tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any) None

Initialize Clement instance.

Parameters:
alphabetCounter, collection, int, or None

This represents the alphabet of possible tokens. See alphabet description in _TokenDistance for details.

tokenizer_Tokenizer

A tokenizer instance from the abydos.tokenizer package

intersection_typestr

Specifies the intersection type, and set type as a result: See intersection_type description in _TokenDistance for details.

**kwargs

Arbitrary keyword arguments

Other Parameters:
qvalint

The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.

metric_Distance

A string distance measure class for use in the soft and fuzzy variants.

thresholdfloat

A threshold value, similarities above which are counted as members of the intersection for the fuzzy variant.

.. versionadded:: 0.4.0
sim(src: str, tar: str) float

Return the Clement similarity of two strings.

Parameters:
srcstr

Source string (or QGrams/Counter objects) for comparison

tarstr

Target string (or QGrams/Counter objects) for comparison

Returns:
float

Clement similarity

Examples

>>> cmp = Clement()
>>> cmp.sim('cat', 'hat')
0.5025379382522239
>>> cmp.sim('Niall', 'Neil')
0.33840586363079933
>>> cmp.sim('aluminum', 'Catalan')
0.12119877280918714
>>> cmp.sim('ATCG', 'TAGC')
0.006336616803332366

Added in version 0.4.0.

_cormode_lz

abydos.distance._cormode_lz.

Cormode’s LZ distance

class distances._cormode_lz.CormodeLZ(**kwargs: Any)

Bases: _Distance

Cormode’s LZ distance.

Cormode’s LZ distance :cite:`Cormode:2000,Cormode:2003`

Added in version 0.4.0.

Methods

dist(src, tar)

Return the normalized Cormode's LZ distance of two strings.

dist_abs(src, tar)

Return the Cormode's LZ distance of two strings.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar)

Return similarity.

__init__(**kwargs: Any) None

Initialize CormodeLZ instance.

Parameters:
**kwargs

Arbitrary keyword arguments

.. versionadded:: 0.4.0
dist(src: str, tar: str) float

Return the normalized Cormode’s LZ distance of two strings.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
float

Cormode’s LZ distance

Examples

>>> cmp = CormodeLZ()
>>> cmp.dist('cat', 'hat')
0.3333333333333333
>>> cmp.dist('Niall', 'Neil')
0.8
>>> cmp.dist('aluminum', 'Catalan')
0.625
>>> cmp.dist('ATCG', 'TAGC')
0.75

Added in version 0.4.0.

dist_abs(src: str, tar: str) float

Return the Cormode’s LZ distance of two strings.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
float

Cormode’s LZ distance

Examples

>>> cmp = CormodeLZ()
>>> cmp.dist_abs('cat', 'hat')
2
>>> cmp.dist_abs('Niall', 'Neil')
5
>>> cmp.dist_abs('aluminum', 'Catalan')
6
>>> cmp.dist_abs('ATCG', 'TAGC')
4

Added in version 0.4.0.

_damerau_levenshtein

abydos.distance._damerau_levenshtein.

Damerau-Levenshtein distance

class distances._damerau_levenshtein.DamerauLevenshtein(cost: ~typing.Tuple[float, float, float, float] = (1, 1, 1, 1), normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, **kwargs: ~typing.Any)

Bases: _Distance

Damerau-Levenshtein distance.

This computes the Damerau-Levenshtein distance :cite:`Damerau:1964`. Damerau-Levenshtein code is based on Java code by Kevin L. Stern :cite:`Stern:2014`, under the MIT license: https://github.com/KevinStern/software-and-algorithms/blob/master/src/main/java/blogspot/software_and_algorithms/stern_library/string/DamerauLevenshteinAlgorithm.java

Methods

dist(src, tar)

Return the Damerau-Levenshtein similarity of two strings.

dist_abs(src, tar)

Return the Damerau-Levenshtein distance between two strings.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar)

Return similarity.

__init__(cost: ~typing.Tuple[float, float, float, float] = (1, 1, 1, 1), normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, **kwargs: ~typing.Any)

Initialize Levenshtein instance.

Parameters:
costtuple

A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))

normalizerfunction

A function that takes an list and computes a normalization term by which the edit distance is divided (max by default). Another good option is the sum function.

**kwargs

Arbitrary keyword arguments

.. versionadded:: 0.4.0
dist(src: str, tar: str) float

Return the Damerau-Levenshtein similarity of two strings.

Damerau-Levenshtein distance normalized to the interval [0, 1].

The Damerau-Levenshtein distance is normalized by dividing the Damerau-Levenshtein distance by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have \(cost = 1\), this is equivalent to the greater of the length of the two strings src & tar.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
float

The normalized Damerau-Levenshtein distance

Examples

>>> cmp = DamerauLevenshtein()
>>> round(cmp.dist('cat', 'hat'), 12)
0.333333333333
>>> round(cmp.dist('Niall', 'Neil'), 12)
0.6
>>> cmp.dist('aluminum', 'Catalan')
0.875
>>> cmp.dist('ATCG', 'TAGC')
0.5

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

dist_abs(src: str, tar: str) float

Return the Damerau-Levenshtein distance between two strings.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
int (may return a float if cost has float values)

The Damerau-Levenshtein distance between src & tar

Raises:
ValueError

Unsupported cost assignment; the cost of two transpositions must not be less than the cost of an insert plus a delete.

Examples

>>> cmp = DamerauLevenshtein()
>>> cmp.dist_abs('cat', 'hat')
1
>>> cmp.dist_abs('Niall', 'Neil')
3
>>> cmp.dist_abs('aluminum', 'Catalan')
7
>>> cmp.dist_abs('ATCG', 'TAGC')
2

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

_dice_asymmetric_i

abydos.distance._dice_asymmetric_i.

Dice’s Asymmetric I similarity

class distances._dice_asymmetric_i.DiceAsymmetricI(tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any)

Bases: _TokenDistance

Dice’s Asymmetric I similarity.

For two sets X and Y and a population N, Dice’s Asymmetric I similarity :cite:`Dice:1945` is

\[sim_{DiceAsymmetricI}(X, Y) = \frac{|X \cap Y|}{|X|}\]

In 2x2 confusion table terms, where a+b+c+d=n, this is

\[sim_{DiceAsymmetricI} = \frac{a}{a+b}\]

Methods

dist(src, tar)

Return distance.

dist_abs(src, tar)

Return absolute distance.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar)

Return the Dice's Asymmetric I similarity of two strings.

Notes

In terms of a confusion matrix, this is equivalent to precision or positive predictive value ConfusionTable.precision().

Added in version 0.4.0.

__init__(tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any) None

Initialize DiceAsymmetricI instance.

Parameters:
tokenizer_Tokenizer

A tokenizer instance from the abydos.tokenizer package

intersection_typestr

Specifies the intersection type, and set type as a result: See intersection_type description in _TokenDistance for details.

**kwargs

Arbitrary keyword arguments

Other Parameters:
qvalint

The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.

metric_Distance

A string distance measure class for use in the soft and fuzzy variants.

thresholdfloat

A threshold value, similarities above which are counted as members of the intersection for the fuzzy variant.

.. versionadded:: 0.4.0
sim(src: str, tar: str) float

Return the Dice’s Asymmetric I similarity of two strings.

Parameters:
srcstr

Source string (or QGrams/Counter objects) for comparison

tarstr

Target string (or QGrams/Counter objects) for comparison

Returns:
float

Dice’s Asymmetric I similarity

Examples

>>> cmp = DiceAsymmetricI()
>>> cmp.sim('cat', 'hat')
0.5
>>> cmp.sim('Niall', 'Neil')
0.3333333333333333
>>> cmp.sim('aluminum', 'Catalan')
0.1111111111111111
>>> cmp.sim('ATCG', 'TAGC')
0.0

Added in version 0.4.0.

_discounted_levenshtein

abydos.distance._discounted_levenshtein.

Discounted Levenshtein edit distance

class distances._discounted_levenshtein.DiscountedLevenshtein(mode: str = 'lev', normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, discount_from: int | str = 1, discount_func: str | ~typing.Callable[[float], float] = 'log', vowels: str = 'aeiou', **kwargs: ~typing.Any)

Bases: Levenshtein

Discounted Levenshtein distance.

This is a variant of Levenshtein distance for which edits later in a string have discounted cost, on the theory that earlier edits are less likely than later ones.

Added in version 0.4.1.

Methods

alignment(src, tar)

Return the Levenshtein alignment of two strings.

dist(src, tar)

Return the normalized Levenshtein distance between two strings.

dist_abs(src, tar)

Return the Levenshtein distance between two strings.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar)

Return similarity.

__init__(mode: str = 'lev', normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, discount_from: int | str = 1, discount_func: str | ~typing.Callable[[float], float] = 'log', vowels: str = 'aeiou', **kwargs: ~typing.Any) None

Initialize DiscountedLevenshtein instance.

Parameters:
modestr

Specifies a mode for computing the discounted Levenshtein distance:

  • lev (default) computes the ordinary Levenshtein distance, in which edits may include inserts, deletes, and substitutions

  • osa computes the Optimal String Alignment distance, in which edits may include inserts, deletes, substitutions, and transpositions but substrings may only be edited once

normalizerfunction

A function that takes an list and computes a normalization term by which the edit distance is divided (max by default). Another good option is the sum function.

discount_fromint or str

If an int is supplied, this is the first character whose edit cost will be discounted. If the str coda is supplied, discounting will start with the first non-vowel after the first vowel (the first syllable coda).

discount_funcstr or function

The two supported str arguments are log, for a logarithmic discount function, and exp for a exponential discount function. See notes below for information on how to supply your own discount function.

vowelsstr

These are the letters to consider as vowels when discount_from is set to coda. It defaults to the English vowels ‘aeiou’, but it would be reasonable to localize this to other languages or to add orthographic semi-vowels like ‘y’, ‘w’, and even ‘h’.

**kwargs

Arbitrary keyword arguments

Notes

This class is highly experimental and will need additional tuning.

The discount function can be passed as a callable function. It should expect an integer as its only argument and return a float, ideally less than or equal to 1.0. The argument represents the degree of discounting to apply.

Added in version 0.4.1.

dist(src: str, tar: str) float

Return the normalized Levenshtein distance between two strings.

The Levenshtein distance is normalized by dividing the Levenshtein distance (calculated by any of the three supported methods) by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have \(cost = 1\), this is equivalent to the greater of the length of the two strings src & tar.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
float

The normalized Levenshtein distance between src & tar

Examples

>>> cmp = DiscountedLevenshtein()
>>> cmp.dist('cat', 'hat')
0.3513958291799864
>>> cmp.dist('Niall', 'Neil')
0.5909885886270658
>>> cmp.dist('aluminum', 'Catalan')
0.8348163322045603
>>> cmp.dist('ATCG', 'TAGC')
0.7217609721523955

Added in version 0.4.1.

dist_abs(src: str, tar: str) float

Return the Levenshtein distance between two strings.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
float (may return a float if cost has float values)

The Levenshtein distance between src & tar

Examples

>>> cmp = DiscountedLevenshtein()
>>> cmp.dist_abs('cat', 'hat')
1
>>> cmp.dist_abs('Niall', 'Neil')
2.526064024369237
>>> cmp.dist_abs('aluminum', 'Catalan')
5.053867269967515
>>> cmp.dist_abs('ATCG', 'TAGC')
2.594032108779918
>>> cmp = DiscountedLevenshtein(mode='osa')
>>> cmp.dist_abs('ATCG', 'TAGC')
1.7482385137517997
>>> cmp.dist_abs('ACTG', 'TAGC')
3.342270622531718

Added in version 0.4.1.

_distance

abydos.distance._distance.

The distance._distance module implements abstract class _Distance.

_double_metaphone

abydos.phonetic._double_metaphone.

Double Metaphone

class distances._double_metaphone.DoubleMetaphone(max_length: int = -1)

Bases: _Phonetic

Double Metaphone.

Based on Lawrence Philips’ (Visual) C++ code from 1999 :cite:`Philips:2000`.

Added in version 0.3.6.

Methods

encode(word)

Return the Double Metaphone code for a word.

encode_alpha(word)

Return the alphabetic Double Metaphone code for a word.

__init__(max_length: int = -1) None

Initialize DoubleMetaphone instance.

Parameters:
max_lengthint

Maximum length of the returned Dolby code – this also activates the fixed-length code mode if it is greater than 0

.. versionadded:: 0.4.0
encode(word: str) str

Return the Double Metaphone code for a word.

Parameters:
wordstr

The word to transform

Returns:
str

The Double Metaphone value(s)

Examples

>>> pe = DoubleMetaphone()
>>> pe.encode('Christopher')
'KRSTFR,'
>>> pe.encode('Niall')
'NL,'
>>> pe.encode('Smith')
'SM0,XMT'
>>> pe.encode('Schmidt')
'XMT,SMT'

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

Changed in version 0.6.0: Made return a str only (comma-separated)

encode_alpha(word: str) str

Return the alphabetic Double Metaphone code for a word.

Parameters:
wordstr

The word to transform

Returns:
str

The alphabetic Double Metaphone value(s)

Examples

>>> pe = DoubleMetaphone()
>>> pe.encode_alpha('Christopher')
'KRSTFR,'
>>> pe.encode_alpha('Niall')
'NL,'
>>> pe.encode_alpha('Smith')
'SMÞ,XMT'
>>> pe.encode_alpha('Schmidt')
'XMT,SMT'

Added in version 0.4.0.

Changed in version 0.6.0: Made return a str only (comma-separated)

_editex

abydos.distance._editex.

editex

class distances._editex.Editex(cost: Tuple[int, int, int] = (0, 1, 2), local: bool = False, taper: bool = False, **kwargs: Any)

Bases: _Distance

Editex.

As described on pages 3 & 4 of :cite:`Zobel:1996`.

The local variant is based on :cite:`Ring:2009`.

Added in version 0.3.6.

Changed in version 0.4.0: Added taper option

Methods

dist(src, tar)

Return the normalized Editex distance between two strings.

dist_abs(src, tar)

Return the Editex distance between two strings.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar)

Return similarity.

__init__(cost: Tuple[int, int, int] = (0, 1, 2), local: bool = False, taper: bool = False, **kwargs: Any) None

Initialize Editex instance.

Parameters:
costtuple

A 3-tuple representing the cost of the four possible edits: match, same-group, and mismatch respectively (by default: (0, 1, 2))

localbool

If True, the local variant of Editex is used

taperbool

Enables cost tapering. Following :cite:`Zobel:1996`, it causes edits at the start of the string to “just [exceed] twice the minimum penalty for replacement or deletion at the end of the string”.

**kwargs

Arbitrary keyword arguments

.. versionadded:: 0.4.0
dist(src: str, tar: str) float

Return the normalized Editex distance between two strings.

The Editex distance is normalized by dividing the Editex distance (calculated by any of the three supported methods) by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have \(cost = 1\), this is equivalent to the greater of the length of the two strings src & tar.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
int

Normalized Editex distance

Examples

>>> cmp = Editex()
>>> round(cmp.dist('cat', 'hat'), 12)
0.333333333333
>>> round(cmp.dist('Niall', 'Neil'), 12)
0.2
>>> cmp.dist('aluminum', 'Catalan')
0.75
>>> cmp.dist('ATCG', 'TAGC')
0.75

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

dist_abs(src: str, tar: str) float

Return the Editex distance between two strings.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
int

Editex distance

Examples

>>> cmp = Editex()
>>> cmp.dist_abs('cat', 'hat')
2
>>> cmp.dist_abs('Niall', 'Neil')
2
>>> cmp.dist_abs('aluminum', 'Catalan')
12
>>> cmp.dist_abs('ATCG', 'TAGC')
6

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

_fuzzywuzzy_partial_string

abydos.distance._fuzzywuzzy_partial_string.

FuzzyWuzzy Partial String similarity

class distances._fuzzywuzzy_partial_string.FuzzyWuzzyPartialString(**kwargs: Any)

Bases: _Distance

FuzzyWuzzy Partial String similarity.

This follows the FuzzyWuzzy Partial String similarity algorithm :cite:`Cohen:2011`. Rather than returning an integer in the range [0, 100], as demonstrated in the blog post, this implementation returns a float in the range [0.0, 1.0].

Added in version 0.4.0.

Methods

dist(src, tar)

Return distance.

dist_abs(src, tar)

Return absolute distance.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar)

Return the FuzzyWuzzy Partial String similarity of two strings.

sim(src: str, tar: str) float

Return the FuzzyWuzzy Partial String similarity of two strings.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
float

FuzzyWuzzy Partial String similarity

Examples

>>> cmp = FuzzyWuzzyPartialString()
>>> round(cmp.sim('cat', 'hat'), 12)
0.666666666667
>>> round(cmp.sim('Niall', 'Neil'), 12)
0.75
>>> round(cmp.sim('aluminum', 'Catalan'), 12)
0.428571428571
>>> cmp.sim('ATCG', 'TAGC')
0.5

Added in version 0.4.0.

_fuzzywuzzy_token_set

abydos.distance._fuzzywuzzy_token_set.

FuzzyWuzzy Token Set similarity

class distances._fuzzywuzzy_token_set.FuzzyWuzzyTokenSet(tokenizer: _Tokenizer | None = None, **kwargs: Any)

Bases: _TokenDistance

FuzzyWuzzy Token Set similarity.

This follows the FuzzyWuzzy Token Set similarity algorithm :cite:`Cohen:2011`. Rather than returning an integer in the range [0, 100], as demonstrated in the blog post, this implementation returns a float in the range [0.0, 1.0]. Distinct from the

Added in version 0.4.0.

Methods

dist(src, tar)

Return distance.

dist_abs(src, tar)

Return absolute distance.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar)

Return the FuzzyWuzzy Token Set similarity of two strings.

__init__(tokenizer: _Tokenizer | None = None, **kwargs: Any) None

Initialize FuzzyWuzzyTokenSet instance.

Parameters:
tokenizer_Tokenizer

A tokenizer instance from the abydos.tokenizer package. By default, the regexp tokenizer is employed, matching only letters.

**kwargs

Arbitrary keyword arguments

Other Parameters:
qvalint

The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.

.. versionadded:: 0.4.0
sim(src: str, tar: str) float

Return the FuzzyWuzzy Token Set similarity of two strings.

Parameters:
srcstr

Source string (or QGrams/Counter objects) for comparison

tarstr

Target string (or QGrams/Counter objects) for comparison

Returns:
float

FuzzyWuzzy Token Set similarity

Examples

>>> cmp = FuzzyWuzzyTokenSet()
>>> cmp.sim('cat', 'hat')
0.75
>>> cmp.sim('Niall', 'Neil')
0.7272727272727273
>>> cmp.sim('aluminum', 'Catalan')
0.47058823529411764
>>> cmp.sim('ATCG', 'TAGC')
0.6

Added in version 0.4.0.

_fuzzywuzzy_token_sort

abydos.distance._fuzzywuzzy_token_sort.

FuzzyWuzzy Token Sort similarity

class distances._fuzzywuzzy_token_sort.FuzzyWuzzyTokenSort(tokenizer: _Tokenizer | None = None, **kwargs: Any)

Bases: _TokenDistance

FuzzyWuzzy Token Sort similarity.

This follows the FuzzyWuzzy Token Sort similarity algorithm :cite:`Cohen:2011`. Rather than returning an integer in the range [0, 100], as demonstrated in the blog post, this implementation returns a float in the range [0.0, 1.0].

Added in version 0.4.0.

Methods

dist(src, tar)

Return distance.

dist_abs(src, tar)

Return absolute distance.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar)

Return the FuzzyWuzzy Token Sort similarity of two strings.

__init__(tokenizer: _Tokenizer | None = None, **kwargs: Any) None

Initialize FuzzyWuzzyTokenSort instance.

Parameters:
tokenizer_Tokenizer

A tokenizer instance from the abydos.tokenizer package. By default, the regexp tokenizer is employed, matching only letters.

**kwargs

Arbitrary keyword arguments

Other Parameters:
qvalint

The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.

.. versionadded:: 0.4.0
sim(src: str, tar: str) float

Return the FuzzyWuzzy Token Sort similarity of two strings.

Parameters:
srcstr

Source string (or QGrams/Counter objects) for comparison

tarstr

Target string (or QGrams/Counter objects) for comparison

Returns:
float

FuzzyWuzzy Token Sort similarity

Examples

>>> cmp = FuzzyWuzzyTokenSort()
>>> cmp.sim('cat', 'hat')
0.6666666666666666
>>> cmp.sim('Niall', 'Neil')
0.6666666666666666
>>> cmp.sim('aluminum', 'Catalan')
0.4
>>> cmp.sim('ATCG', 'TAGC')
0.5

Added in version 0.4.0.

_hamming

abydos.distance._hamming.

Hamming distance

class distances._hamming.Hamming(diff_lens: bool = True, **kwargs: Any)

Bases: _Distance

Hamming distance.

Hamming distance :cite:`Hamming:1950` equals the number of character positions at which two strings differ. For strings of unequal lengths, it is not normally defined. By default, this implementation calculates the Hamming distance of the first n characters where n is the lesser of the two strings’ lengths and adds to this the difference in string lengths.

Added in version 0.3.6.

Methods

dist(src, tar)

Return the normalized Hamming distance between two strings.

dist_abs(src, tar)

Return the Hamming distance between two strings.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar)

Return similarity.

__init__(diff_lens: bool = True, **kwargs: Any) None

Initialize Hamming instance.

Parameters:
diff_lensbool

If True (default), this returns the Hamming distance for those characters that have a matching character in both strings plus the difference in the strings’ lengths. This is equivalent to extending the shorter string with obligatorily non-matching characters. If False, an exception is raised in the case of strings of unequal lengths.

**kwargs

Arbitrary keyword arguments

.. versionadded:: 0.4.0
dist(src: str, tar: str) float

Return the normalized Hamming distance between two strings.

Hamming distance normalized to the interval [0, 1].

The Hamming distance is normalized by dividing it by the greater of the number of characters in src & tar (unless diff_lens is set to False, in which case an exception is raised).

The arguments are identical to those of the hamming() function.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
float

Normalized Hamming distance

Examples

>>> cmp = Hamming()
>>> round(cmp.dist('cat', 'hat'), 12)
0.333333333333
>>> cmp.dist('Niall', 'Neil')
0.6
>>> cmp.dist('aluminum', 'Catalan')
1.0
>>> cmp.dist('ATCG', 'TAGC')
1.0

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

dist_abs(src: str, tar: str) float

Return the Hamming distance between two strings.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
int

The Hamming distance between src & tar

Raises:
ValueError

Undefined for sequences of unequal length; set diff_lens to True for Hamming distance between strings of unequal lengths.

Examples

>>> cmp = Hamming()
>>> cmp.dist_abs('cat', 'hat')
1
>>> cmp.dist_abs('Niall', 'Neil')
3
>>> cmp.dist_abs('aluminum', 'Catalan')
8
>>> cmp.dist_abs('ATCG', 'TAGC')
4

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

_indel

abydos.distance._indel.

Indel distance

class distances._indel.Indel(**kwargs: Any)

Bases: Levenshtein

Indel distance.

This is equivalent to Levenshtein distance, when only inserts and deletes are possible.

Added in version 0.3.6.

Methods

alignment(src, tar)

Return the Levenshtein alignment of two strings.

dist(src, tar)

Return the normalized indel distance between two strings.

dist_abs(src, tar)

Return the Levenshtein distance between two strings.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar)

Return similarity.

__init__(**kwargs: Any) None

Initialize Levenshtein instance.

Parameters:
**kwargs

Arbitrary keyword arguments

.. versionadded:: 0.4.0
dist(src: str, tar: str) float

Return the normalized indel distance between two strings.

This is equivalent to normalized Levenshtein distance, when only inserts and deletes are possible.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
float

Normalized indel distance

Examples

>>> cmp = Indel()
>>> round(cmp.dist('cat', 'hat'), 12)
0.333333333333
>>> round(cmp.dist('Niall', 'Neil'), 12)
0.333333333333
>>> round(cmp.dist('Colin', 'Cuilen'), 12)
0.454545454545
>>> cmp.dist('ATCG', 'TAGC')
0.5

Added in version 0.3.6.

_iterative_substring

abydos.distance._iterative_substring.

Iterative-SubString (I-Sub) correlation

class distances._iterative_substring.IterativeSubString(hamacher: float = 0.6, normalize_strings: bool = False, **kwargs: Any)

Bases: _Distance

Iterative-SubString correlation.

Iterative-SubString (I-Sub) correlation :cite:`Stoilos:2005`

This is a straightforward port of the primary author’s Java implementation: http://www.image.ece.ntua.gr/~gstoil/software/I_Sub.java

Added in version 0.4.0.

Methods

corr(src, tar)

Return the Iterative-SubString correlation of two strings.

dist(src, tar)

Return distance.

dist_abs(src, tar)

Return absolute distance.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar)

Return the Iterative-SubString similarity of two strings.

__init__(hamacher: float = 0.6, normalize_strings: bool = False, **kwargs: Any) None

Initialize IterativeSubString instance.

Parameters:
hamacherfloat

The constant factor for the Hamacher product

normalize_stringsbool

Normalize the strings by removing the characters in ‘._ ‘ and lower casing

**kwargs

Arbitrary keyword arguments

.. versionadded:: 0.4.0
corr(src: str, tar: str) float

Return the Iterative-SubString correlation of two strings.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
float

Iterative-SubString correlation

Examples

>>> cmp = IterativeSubString()
>>> cmp.corr('cat', 'hat')
-1.0
>>> cmp.corr('Niall', 'Neil')
-0.9
>>> cmp.corr('aluminum', 'Catalan')
-1.0
>>> cmp.corr('ATCG', 'TAGC')
-1.0

Added in version 0.4.0.

sim(src: str, tar: str) float

Return the Iterative-SubString similarity of two strings.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
float

Iterative-SubString similarity

Examples

>>> cmp = IterativeSubString()
>>> cmp.sim('cat', 'hat')
0.0
>>> cmp.sim('Niall', 'Neil')
0.04999999999999999
>>> cmp.sim('aluminum', 'Catalan')
0.0
>>> cmp.sim('ATCG', 'TAGC')
0.0

Added in version 0.4.0.

_kuhns_iii

abydos.distance._kuhns_iii.

Kuhns III correlation

class distances._kuhns_iii.KuhnsIII(alphabet: Counter[str] | Sequence[str] | Set[str] | int | None = None, tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any)

Bases: _TokenDistance

Kuhns III correlation.

For two sets X and Y and a population N, Kuhns III correlation :cite:`Kuhns:1965`, the excess of proportion of overlap over its independence value (P), is

\[corr_{KuhnsIII}(X, Y) = \frac{\delta(X, Y)}{\big(1-\frac{|X \cap Y|}{|X|+|Y|}\big) \big(|X|+|Y|-\frac{|X|\cdot|Y|}{|N|}\big)}\]

where

\[\delta(X, Y) = |X \cap Y| - \frac{|X| \cdot |Y|}{|N|}\]

In 2x2 confusion table terms, where a+b+c+d=n, this is

\[corr_{KuhnsIII} = \frac{\delta(a+b, a+c)}{\big(1-\frac{a}{2a+b+c}\big) \big(2a+b+c-\frac{(a+b)(a+c)}{n}\big)}\]

where

\[\delta(a+b, a+c) = a - \frac{(a+b)(a+c)}{n}\]

Methods

corr(src, tar)

Return the Kuhns III correlation of two strings.

dist(src, tar)

Return distance.

dist_abs(src, tar)

Return absolute distance.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar)

Return the Kuhns III similarity of two strings.

Notes

The coefficient presented in :cite:`Eidenberger:2014,Morris:2012` as Kuhns’ “Proportion of overlap above independence” is a significantly different coefficient, not evidenced in :cite:`Kuhns:1965`.

Added in version 0.4.0.

__init__(alphabet: Counter[str] | Sequence[str] | Set[str] | int | None = None, tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any) None

Initialize KuhnsIII instance.

Parameters:
alphabetCounter, collection, int, or None

This represents the alphabet of possible tokens. See alphabet description in _TokenDistance for details.

tokenizer_Tokenizer

A tokenizer instance from the abydos.tokenizer package

intersection_typestr

Specifies the intersection type, and set type as a result: See intersection_type description in _TokenDistance for details.

**kwargs

Arbitrary keyword arguments

Other Parameters:
qvalint

The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.

metric_Distance

A string distance measure class for use in the soft and fuzzy variants.

thresholdfloat

A threshold value, similarities above which are counted as members of the intersection for the fuzzy variant.

.. versionadded:: 0.4.0
corr(src: str, tar: str) float

Return the Kuhns III correlation of two strings.

Parameters:
srcstr

Source string (or QGrams/Counter objects) for comparison

tarstr

Target string (or QGrams/Counter objects) for comparison

Returns:
float

Kuhns III correlation

Examples

>>> cmp = KuhnsIII()
>>> cmp.corr('cat', 'hat')
0.3307757885763001
>>> cmp.corr('Niall', 'Neil')
0.21873141468207793
>>> cmp.corr('aluminum', 'Catalan')
0.05707545392902886
>>> cmp.corr('ATCG', 'TAGC')
-0.003198976327575176

Added in version 0.4.0.

sim(src: str, tar: str) float

Return the Kuhns III similarity of two strings.

Parameters:
srcstr

Source string (or QGrams/Counter objects) for comparison

tarstr

Target string (or QGrams/Counter objects) for comparison

Returns:
float

Kuhns III similarity

Examples

>>> cmp = KuhnsIII()
>>> cmp.sim('cat', 'hat')
0.498081841432225
>>> cmp.sim('Niall', 'Neil')
0.41404856101155846
>>> cmp.sim('aluminum', 'Catalan')
0.29280659044677165
>>> cmp.sim('ATCG', 'TAGC')
0.24760076775431863

Added in version 0.4.0.

_lcprefix

abydos.distance._lcprefix.

Longest common prefix

class distances._lcprefix.LCPrefix(**kwargs: Any)

Bases: _Distance

Longest common prefix.

Added in version 0.4.0.

Methods

dist(src, tar)

Return distance.

dist_abs(src, tar, *args)

Return the length of the longest common prefix of the strings.

lcprefix(strings)

Return the longest common prefix of a list of strings.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar, *args)

Return the longest common prefix similarity of two or more strings.

dist_abs(src: str, tar: str, *args: str) int

Return the length of the longest common prefix of the strings.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

*argsstrs

Additional strings for comparison

Returns:
int

The length of the longest common prefix

Raises:
ValueError

All arguments must be of type str

Examples

>>> pfx = LCPrefix()
>>> pfx.dist_abs('cat', 'hat')
0
>>> pfx.dist_abs('Niall', 'Neil')
1
>>> pfx.dist_abs('aluminum', 'Catalan')
0
>>> pfx.dist_abs('ATCG', 'TAGC')
0

Added in version 0.4.0.

lcprefix(strings: List[str]) str

Return the longest common prefix of a list of strings.

Longest common prefix (LCPrefix).

Parameters:
stringslist of strings

Strings for comparison

Returns:
str

The longest common prefix

Examples

>>> pfx = LCPrefix()
>>> pfx.lcprefix(['cat', 'hat'])
''
>>> pfx.lcprefix(['Niall', 'Neil'])
'N'
>>> pfx.lcprefix(['aluminum', 'Catalan'])
''
>>> pfx.lcprefix(['ATCG', 'TAGC'])
''

Added in version 0.4.0.

sim(src: str, tar: str, *args: str) float

Return the longest common prefix similarity of two or more strings.

Longest common prefix similarity (\(sim_{LCPrefix}\)).

This employs the LCPrefix function to derive a similarity metric: \(sim_{LCPrefix}(s,t) = \frac{|LCPrefix(s,t)|}{max(|s|, |t|)}\)

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

*argsstrs

Additional strings for comparison

Returns:
float

LCPrefix similarity

Examples

>>> pfx = LCPrefix()
>>> pfx.sim('cat', 'hat')
0.0
>>> pfx.sim('Niall', 'Neil')
0.2
>>> pfx.sim('aluminum', 'Catalan')
0.0
>>> pfx.sim('ATCG', 'TAGC')
0.0

Added in version 0.4.0.

_lcsseq

abydos.distance._lcsseq.

Longest common subsequence

class distances._lcsseq.LCSseq(normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, **kwargs: ~typing.Any)

Bases: _Distance

Longest common subsequence.

Longest common subsequence (LCSseq) is the longest subsequence of characters that two strings have in common.

Added in version 0.3.6.

Methods

dist(src, tar)

Return distance.

dist_abs(src, tar)

Return absolute distance.

lcsseq(src, tar)

Return the longest common subsequence of two strings.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar)

Return the longest common subsequence similarity of two strings.

__init__(normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, **kwargs: ~typing.Any) None

Initialize LCSseq.

Parameters:
normalizerfunction

A normalization function for the normalized similarity & distance. By default, the max of the lengths of the input strings. If lambda x: sum(x)/2.0 is supplied, the normalization proposed in :cite:`Radev:2001` is used, i.e. \(\frac{2 \dot |LCS(src, tar)|}{|src| + |tar|}\).

**kwargs

Arbitrary keyword arguments

.. versionadded:: 0.4.0
lcsseq(src: str, tar: str) str

Return the longest common subsequence of two strings.

Based on the dynamic programming algorithm from http://rosettacode.org/wiki/Longest_common_subsequence :cite:`rosettacode:2018b`. This is licensed GFDL 1.2.

Modifications include:

conversion to a numpy array in place of a list of lists

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
str

The longest common subsequence

Examples

>>> sseq = LCSseq()
>>> sseq.lcsseq('cat', 'hat')
'at'
>>> sseq.lcsseq('Niall', 'Neil')
'Nil'
>>> sseq.lcsseq('aluminum', 'Catalan')
'aln'
>>> sseq.lcsseq('ATCG', 'TAGC')
'AC'

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

sim(src: str, tar: str) float

Return the longest common subsequence similarity of two strings.

Longest common subsequence similarity (\(sim_{LCSseq}\)).

This employs the LCSseq function to derive a similarity metric: \(sim_{LCSseq}(s,t) = \frac{|LCSseq(s,t)|}{max(|s|, |t|)}\)

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
float

LCSseq similarity

Examples

>>> sseq = LCSseq()
>>> sseq.sim('cat', 'hat')
0.6666666666666666
>>> sseq.sim('Niall', 'Neil')
0.6
>>> sseq.sim('aluminum', 'Catalan')
0.375
>>> sseq.sim('ATCG', 'TAGC')
0.5

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

Changed in version 0.4.0: Added normalization option

_levenshtein

abydos.distance._levenshtein.

The distance._Levenshtein module implements string edit distance functions based on Levenshtein distance, including:

  • Levenshtein distance

  • Optimal String Alignment distance

class distances._levenshtein.Levenshtein(mode: str = 'lev', cost: ~typing.Tuple[float, float, float, float] = (1, 1, 1, 1), normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, taper: bool = False, **kwargs: ~typing.Any)

Bases: _Distance

Levenshtein distance.

This is the standard edit distance measure. Cf. :cite:`Levenshtein:1965,Levenshtein:1966`.

Optimal string alignment (aka restricted Damerau-Levenshtein distance) :cite:`Boytsov:2011` is also supported.

The ordinary Levenshtein & Optimal String Alignment distance both employ the Wagner-Fischer dynamic programming algorithm :cite:`Wagner:1974`.

Levenshtein edit distance ordinarily has unit insertion, deletion, and substitution costs.

Added in version 0.3.6.

Changed in version 0.4.0: Added taper option

Methods

alignment(src, tar)

Return the Levenshtein alignment of two strings.

dist(src, tar)

Return the normalized Levenshtein distance between two strings.

dist_abs(src, tar)

Return the Levenshtein distance between two strings.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar)

Return similarity.

__init__(mode: str = 'lev', cost: ~typing.Tuple[float, float, float, float] = (1, 1, 1, 1), normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, taper: bool = False, **kwargs: ~typing.Any) None

Initialize Levenshtein instance.

Parameters:
modestr

Specifies a mode for computing the Levenshtein distance:

  • lev (default) computes the ordinary Levenshtein distance, in which edits may include inserts, deletes, and substitutions

  • osa computes the Optimal String Alignment distance, in which edits may include inserts, deletes, substitutions, and transpositions but substrings may only be edited once

costtuple

A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))

normalizerfunction

A function that takes an list and computes a normalization term by which the edit distance is divided (max by default). Another good option is the sum function.

taperbool

Enables cost tapering. Following :cite:`Zobel:1996`, it causes edits at the start of the string to “just [exceed] twice the minimum penalty for replacement or deletion at the end of the string”.

**kwargs

Arbitrary keyword arguments

.. versionadded:: 0.4.0
alignment(src: str, tar: str) Tuple[float, str, str]

Return the Levenshtein alignment of two strings.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
tuple

A tuple containing the Levenshtein distance and the two strings, aligned.

Examples

>>> cmp = Levenshtein()
>>> cmp.alignment('cat', 'hat')
(1.0, 'cat', 'hat')
>>> cmp.alignment('Niall', 'Neil')
(3.0, 'N-iall', 'Nei-l-')
>>> cmp.alignment('aluminum', 'Catalan')
(7.0, '-aluminum', 'Catalan--')
>>> cmp.alignment('ATCG', 'TAGC')
(3.0, 'ATCG-', '-TAGC')
>>> cmp = Levenshtein(mode='osa')
>>> cmp.alignment('ATCG', 'TAGC')
(2.0, 'ATCG', 'TAGC')
>>> cmp.alignment('ACTG', 'TAGC')
(4.0, 'ACT-G-', '--TAGC')

Added in version 0.4.1.

dist(src: str, tar: str) float

Return the normalized Levenshtein distance between two strings.

The Levenshtein distance is normalized by dividing the Levenshtein distance (calculated by either of the two supported methods) by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have \(cost = 1\), this is equivalent to the greater of the length of the two strings src & tar.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
float

The normalized Levenshtein distance between src & tar

Examples

>>> cmp = Levenshtein()
>>> round(cmp.dist('cat', 'hat'), 12)
0.333333333333
>>> round(cmp.dist('Niall', 'Neil'), 12)
0.6
>>> cmp.dist('aluminum', 'Catalan')
0.875
>>> cmp.dist('ATCG', 'TAGC')
0.75

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

dist_abs(src: str, tar: str) float

Return the Levenshtein distance between two strings.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
int (may return a float if cost has float values)

The Levenshtein distance between src & tar

Examples

>>> cmp = Levenshtein()
>>> cmp.dist_abs('cat', 'hat')
1
>>> cmp.dist_abs('Niall', 'Neil')
3
>>> cmp.dist_abs('aluminum', 'Catalan')
7
>>> cmp.dist_abs('ATCG', 'TAGC')
3
>>> cmp = Levenshtein(mode='osa')
>>> cmp.dist_abs('ATCG', 'TAGC')
2
>>> cmp.dist_abs('ACTG', 'TAGC')
4

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

sim(src: str, tar: str)

Return similarity.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
float

Similarity

Added in version 0.3.6: ..

_lig3

abydos.distance._lig3.

LIG3 similarity

class distances._lig3.LIG3(**kwargs: Any)

Bases: _Distance

LIG3 similarity.

:cite:`Snae:2002` proposes three Levenshtein-ISG-Guth hybrid similarity measures: LIG1, LIG2, and LIG3. Of these, LIG1 is identical to ISG and LIG2 is identical to normalized Levenshtein similarity. Only LIG3 is a novel measure, defined as:

\[sim_{LIG3}(X, Y) = \frac{2I}{2I+C}\]

Here, I is the number of exact matches between the two words, truncated to the length of the shorter word, and C is the Levenshtein distance between the two words.

Added in version 0.4.1.

Methods

dist(src, tar)

Return distance.

dist_abs(src, tar)

Return absolute distance.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar)

Return the LIG3 similarity of two words.

sim(src: str, tar: str) float

Return the LIG3 similarity of two words.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
float

The LIG3 similarity

Examples

>>> cmp = LIG3()
>>> cmp.sim('cat', 'hat')
0.8
>>> cmp.sim('Niall', 'Neil')
0.5714285714285714
>>> cmp.sim('aluminum', 'Catalan')
0.0
>>> cmp.sim('ATCG', 'TAGC')
0.0

Added in version 0.4.1.

_ncd_bz2

abydos.distance._ncd_bz2.

NCD using bzip2

class distances._ncd_bz2.NCDbz2(level: int = 9, **kwargs: Any)

Bases: _Distance

Normalized Compression Distance using bzip2 compression.

Cf. https://en.wikipedia.org/wiki/Bzip2

Normalized compression distance (NCD) :cite:`Cilibrasi:2005`.

Added in version 0.3.6.

Methods

dist(src, tar)

Return the NCD between two strings using bzip2 compression.

dist_abs(src, tar)

Return absolute distance.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar)

Return similarity.

__init__(level: int = 9, **kwargs: Any) None

Initialize bzip2 compressor.

Parameters:
levelint

The compression level (0 to 9)

.. versionadded:: 0.3.6
.. versionchanged:: 0.3.6

Encapsulated in class

dist(src: str, tar: str) float

Return the NCD between two strings using bzip2 compression.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
float

Compression distance

Examples

>>> cmp = NCDbz2()
>>> cmp.dist('cat', 'hat')
0.06666666666666667
>>> cmp.dist('Niall', 'Neil')
0.03125
>>> cmp.dist('aluminum', 'Catalan')
0.17647058823529413
>>> cmp.dist('ATCG', 'TAGC')
0.03125

Added in version 0.3.5.

Changed in version 0.3.6: Encapsulated in class

_overlap

abydos.distance._overlap.

Overlap similarity & distance

class distances._overlap.Overlap(tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any)

Bases: _TokenDistance

Overlap coefficient.

For two sets X and Y, the overlap coefficient :cite:`Szymkiewicz:1934,Simpson:1949`, also called the Szymkiewicz-Simpson coefficient and Simpson’s ecological coexistence coefficient, is

\[sim_{overlap}(X, Y) = \frac{|X \cap Y|}{min(|X|, |Y|)}\]

In 2x2 confusion table terms, where a+b+c+d=n, this is

\[sim_{overlap} = \frac{a}{min(a+b, a+c)}\]

Added in version 0.3.6.

Methods

dist(src, tar)

Return distance.

dist_abs(src, tar)

Return absolute distance.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar)

Return the overlap coefficient of two strings.

__init__(tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any) None

Initialize Overlap instance.

Parameters:
tokenizer_Tokenizer

A tokenizer instance from the abydos.tokenizer package

intersection_typestr

Specifies the intersection type, and set type as a result: See intersection_type description in _TokenDistance for details.

**kwargs

Arbitrary keyword arguments

Other Parameters:
qvalint

The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.

metric_Distance

A string distance measure class for use in the soft and fuzzy variants.

thresholdfloat

A threshold value, similarities above which are counted as members of the intersection for the fuzzy variant.

.. versionadded:: 0.4.0
sim(src: str, tar: str) float

Return the overlap coefficient of two strings.

Parameters:
srcstr

Source string (or QGrams/Counter objects) for comparison

tarstr

Target string (or QGrams/Counter objects) for comparison

Returns:
float

Overlap similarity

Examples

>>> cmp = Overlap()
>>> cmp.sim('cat', 'hat')
0.5
>>> cmp.sim('Niall', 'Neil')
0.4
>>> cmp.sim('aluminum', 'Catalan')
0.125
>>> cmp.sim('ATCG', 'TAGC')
0.0

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

_pearson_chi_squared

abydos.distance._pearson_chi_squared.

Pearson’s Chi-Squared similarity

class distances._pearson_chi_squared.PearsonChiSquared(alphabet: Counter[str] | Sequence[str] | Set[str] | int | None = None, tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any)

Bases: _TokenDistance

Pearson’s Chi-Squared similarity.

For two sets X and Y and a population N, the Pearson’s \(\chi^2\) similarity :cite:`Pearson:1913` is

\[sim_{PearsonChiSquared}(X, Y) = \frac{|N| \cdot (|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|)^2} {|X| \cdot |Y| \cdot |N \setminus X| \cdot |N \setminus Y|}\]

This is also Pearson I similarity.

In 2x2 confusion table terms, where a+b+c+d=n, this is

\[sim_{PearsonChiSquared} = \frac{n(ad-bc)^2}{(a+b)(a+c)(b+d)(c+d)}\]

Added in version 0.4.0.

Methods

corr(src, tar)

Return Pearson's Chi-Squared correlation of two strings.

dist(src, tar)

Return distance.

dist_abs(src, tar)

Return absolute distance.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar)

Return Pearson's normalized Chi-Squared similarity of two strings.

sim_score(src, tar)

Return Pearson's Chi-Squared similarity of two strings.

__init__(alphabet: Counter[str] | Sequence[str] | Set[str] | int | None = None, tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any) None

Initialize PearsonChiSquared instance.

Parameters:
alphabetCounter, collection, int, or None

This represents the alphabet of possible tokens. See alphabet description in _TokenDistance for details.

tokenizer_Tokenizer

A tokenizer instance from the abydos.tokenizer package

intersection_typestr

Specifies the intersection type, and set type as a result: See intersection_type description in _TokenDistance for details.

**kwargs

Arbitrary keyword arguments

Other Parameters:
qvalint

The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.

metric_Distance

A string distance measure class for use in the soft and fuzzy variants.

thresholdfloat

A threshold value, similarities above which are counted as members of the intersection for the fuzzy variant.

.. versionadded:: 0.4.0
corr(src: str, tar: str) float

Return Pearson’s Chi-Squared correlation of two strings.

Parameters:
srcstr

Source string (or QGrams/Counter objects) for comparison

tarstr

Target string (or QGrams/Counter objects) for comparison

Returns:
float

Pearson’s Chi-Squared correlation

Examples

>>> cmp = PearsonChiSquared()
>>> cmp.corr('cat', 'hat')
0.2474424720578567
>>> cmp.corr('Niall', 'Neil')
0.1300991207720222
>>> cmp.corr('aluminum', 'Catalan')
0.011710186806836291
>>> cmp.corr('ATCG', 'TAGC')
-4.1196952743799446e-05

Added in version 0.4.0.

sim(src: str, tar: str) float

Return Pearson’s normalized Chi-Squared similarity of two strings.

Parameters:
srcstr

Source string (or QGrams/Counter objects) for comparison

tarstr

Target string (or QGrams/Counter objects) for comparison

Returns:
float

Normalized Pearson’s Chi-Squared similarity

Examples

>>> cmp = PearsonChiSquared()
>>> cmp.corr('cat', 'hat')
0.2474424720578567
>>> cmp.corr('Niall', 'Neil')
0.1300991207720222
>>> cmp.corr('aluminum', 'Catalan')
0.011710186806836291
>>> cmp.corr('ATCG', 'TAGC')
-4.1196952743799446e-05

Added in version 0.4.0.

sim_score(src: str, tar: str) float

Return Pearson’s Chi-Squared similarity of two strings.

Parameters:
srcstr

Source string (or QGrams/Counter objects) for comparison

tarstr

Target string (or QGrams/Counter objects) for comparison

Returns:
float

Pearson’s Chi-Squared similarity

Examples

>>> cmp = PearsonChiSquared()
>>> cmp.sim_score('cat', 'hat')
193.99489809335964
>>> cmp.sim_score('Niall', 'Neil')
101.99771068526542
>>> cmp.sim_score('aluminum', 'Catalan')
9.19249664336649
>>> cmp.sim_score('ATCG', 'TAGC')
0.032298410951138765

Added in version 0.4.0.

_pearson_ii

abydos.distance._pearson_ii.

Pearson II similarity

class distances._pearson_ii.PearsonII(alphabet: Counter[str] | Sequence[str] | Set[str] | int | None = None, tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any)

Bases: PearsonChiSquared

Pearson II similarity.

For two sets X and Y and a population N, the Pearson II similarity :cite:`Pearson:1913`, Pearson’s coefficient of mean square contingency, is

\[corr_{PearsonII} = \sqrt{\frac{\chi^2}{|N|+\chi^2}}\]

where

\[\chi^2 = sim_{PearsonChiSquared}(X, Y) = \frac{|N| \cdot (|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|)^2} {|X| \cdot |Y| \cdot |N \setminus X| \cdot |N \setminus Y|}\]

In 2x2 confusion table terms, where a+b+c+d=n, this is

\[\chi^2 = sim_{PearsonChiSquared} = \frac{n \cdot (ad-bc)^2}{(a+b)(a+c)(b+d)(c+d)}\]

Added in version 0.4.0.

Methods

corr(src, tar)

Return Pearson's Chi-Squared correlation of two strings.

dist(src, tar)

Return distance.

dist_abs(src, tar)

Return absolute distance.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar)

Return the normalized Pearson II similarity of two strings.

sim_score(src, tar)

Return the Pearson II similarity of two strings.

__init__(alphabet: Counter[str] | Sequence[str] | Set[str] | int | None = None, tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any) None

Initialize PearsonII instance.

Parameters:
alphabetCounter, collection, int, or None

This represents the alphabet of possible tokens. See alphabet description in _TokenDistance for details.

tokenizer_Tokenizer

A tokenizer instance from the abydos.tokenizer package

intersection_typestr

Specifies the intersection type, and set type as a result: See intersection_type description in _TokenDistance for details.

**kwargs

Arbitrary keyword arguments

Other Parameters:
qvalint

The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.

metric_Distance

A string distance measure class for use in the soft and fuzzy variants.

thresholdfloat

A threshold value, similarities above which are counted as members of the intersection for the fuzzy variant.

.. versionadded:: 0.4.0
sim(src: str, tar: str) float

Return the normalized Pearson II similarity of two strings.

Parameters:
srcstr

Source string (or QGrams/Counter objects) for comparison

tarstr

Target string (or QGrams/Counter objects) for comparison

Returns:
float

Normalized Pearson II similarity

Examples

>>> cmp = PearsonII()
>>> cmp.sim('cat', 'hat')
0.6298568508557214
>>> cmp.sim('Niall', 'Neil')
0.47983719547968123
>>> cmp.sim('aluminum', 'Catalan')
0.15214891090821628
>>> cmp.sim('ATCG', 'TAGC')
0.009076921903905551

Added in version 0.4.0.

sim_score(src: str, tar: str) float

Return the Pearson II similarity of two strings.

Parameters:
srcstr

Source string (or QGrams/Counter objects) for comparison

tarstr

Target string (or QGrams/Counter objects) for comparison

Returns:
float

Pearson II similarity

Examples

>>> cmp = PearsonII()
>>> cmp.sim_score('cat', 'hat')
0.44537605041688455
>>> cmp.sim_score('Niall', 'Neil')
0.3392961347892176
>>> cmp.sim_score('aluminum', 'Catalan')
0.10758552665334761
>>> cmp.sim_score('ATCG', 'TAGC')
0.006418353030552324

Added in version 0.4.0.

_phonetic

abydos.phonetic._phonetic.

The phonetic._phonetic module implements abstract class Phonetic.

_phonetic_distance

abydos.distance._phonetic_distance.

Phonetic distance.

class distances._phonetic_distance.PhoneticDistance(transforms: Type[_Phonetic] | _Phonetic | Callable[[str], str] | Sequence[Type[_Phonetic] | _Phonetic | Callable[[str], str]] | None = None, metric: Type[_Distance] | _Distance | None = None, encode_alpha: bool = False, **kwargs: Any)

Bases: _Distance

Phonetic distance.

Phonetic distance applies one or more supplied string transformations to words and compares the resulting transformed strings using a supplied distance measure.

A simple example would be to create a ‘Soundex distance’:

>>> from abydos.phonetic import Soundex
>>> soundex = PhoneticDistance(transforms=Soundex())
>>> soundex.dist('Ashcraft', 'Ashcroft')
0.0
>>> soundex.dist('Robert', 'Ashcraft')
1.0

Added in version 0.4.1.

Methods

dist(src, tar)

Return the normalized Phonetic distance.

dist_abs(src, tar)

Return the Phonetic distance.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar)

Return similarity.

__init__(transforms: Type[_Phonetic] | _Phonetic | Callable[[str], str] | Sequence[Type[_Phonetic] | _Phonetic | Callable[[str], str]] | None = None, metric: Type[_Distance] | _Distance | None = None, encode_alpha: bool = False, **kwargs: Any) None

Initialize PhoneticDistance instance.

Parameters:
transformslist or _Phonetic or _Stemmer or _Fingerprint or type

An instance of a subclass of _Phonetic, _Stemmer, or _Fingerprint, or a list (or other iterable) of such instances to apply to each input word before computing their distance or similarity. If omitted, no transformations will be performed.

metric_Distance or type

An instance of a subclass of _Distance, used for computing the inputs’ distance or similarity after being transformed. If omitted, the strings will be compared for identify (returning 0.0 if identical, otherwise 1.0, when distance is computed).

encode_alphabool

Set to true to use the encode_alpha method of phonetic algorithms whenever possible.

**kwargs

Arbitrary keyword arguments

.. versionadded:: 0.4.1
dist(src: str, tar: str) float

Return the normalized Phonetic distance.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
float

The normalized Phonetic distance

Examples

>>> from abydos.phonetic import Soundex
>>> cmp = PhoneticDistance(Soundex())
>>> cmp.dist('cat', 'hat')
1.0
>>> cmp.dist('Niall', 'Neil')
0.0
>>> cmp.dist('Colin', 'Cuilen')
0.0
>>> cmp.dist('ATCG', 'TAGC')
1.0
>>> from abydos.distance import Levenshtein
>>> cmp = PhoneticDistance(transforms=[Soundex], metric=Levenshtein)
>>> cmp.dist('cat', 'hat')
0.25
>>> cmp.dist('Niall', 'Neil')
0.0
>>> cmp.dist('Colin', 'Cuilen')
0.0
>>> cmp.dist('ATCG', 'TAGC')
0.75

Added in version 0.4.1.

dist_abs(src: str, tar: str) float

Return the Phonetic distance.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
float or int

The Phonetic distance

Examples

>>> from abydos.phonetic import Soundex
>>> cmp = PhoneticDistance(Soundex())
>>> cmp.dist_abs('cat', 'hat')
1
>>> cmp.dist_abs('Niall', 'Neil')
0
>>> cmp.dist_abs('Colin', 'Cuilen')
0
>>> cmp.dist_abs('ATCG', 'TAGC')
1
>>> from abydos.distance import Levenshtein
>>> cmp = PhoneticDistance(transforms=[Soundex], metric=Levenshtein)
>>> cmp.dist_abs('cat', 'hat')
1
>>> cmp.dist_abs('Niall', 'Neil')
0
>>> cmp.dist_abs('Colin', 'Cuilen')
0
>>> cmp.dist_abs('ATCG', 'TAGC')
3

Added in version 0.4.1.

_q_grams

abydos.tokenizer._q_grams.

QGrams multi-set class

class distances._q_grams.QGrams(qval: int | Iterable[int] = 2, start_stop: str = '$#', skip: int | Iterable[int] = 0, scaler: str | Callable[[float], float] | None = None)

Bases: _Tokenizer

A q-gram class, which functions like a bag/multiset.

A q-gram is here defined as all sequences of q characters. Q-grams are also known as k-grams and n-grams, but the term n-gram more typically refers to sequences of whitespace-delimited words in a string, where q-gram refers to sequences of characters in a word or string.

Added in version 0.1.0.

Methods

count()

Return token count.

count_unique()

Return the number of unique elements.

get_counter()

Return the tokens as a Counter object.

get_list()

Return the tokens as an ordered list.

get_set()

Return the unique tokens as a set.

tokenize(string)

Tokenize the term and store it.

__init__(qval: int | Iterable[int] = 2, start_stop: str = '$#', skip: int | Iterable[int] = 0, scaler: str | Callable[[float], float] | None = None) None

Initialize QGrams.

Parameters:
qvalint or Iterable

The q-gram length (defaults to 2), can be an integer, range object, or list

start_stopstr

A string of length >= 0 indicating start & stop symbols. If the string is ‘’, q-grams will be calculated without start & stop symbols appended to each end. Otherwise, the first character of start_stop will pad the beginning of the string and the last character of start_stop will pad the end of the string before q-grams are calculated. (In the case that start_stop is only 1 character long, the same symbol will be used for both.)

skipint or Iterable

The number of characters to skip, can be an integer, range object, or list

scalerNone, str, or function

A scaling function for the Counter:

  • None : no scaling

  • ‘set’ : All non-zero values are set to 1.

  • ‘length’ : Each token has weight equal to its length.

  • ‘length-log’Each token has weight equal to the log of its

    length + 1.

  • ‘length-exp’Each token has weight equal to e raised to its

    length.

  • a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.

Raises:
ValueError

Use WhitespaceTokenizer instead of qval=0.

Examples

>>> qg = QGrams().tokenize('AATTATAT')
>>> qg
QGrams({'$A': 1, 'AA': 1, 'AT': 3, 'TT': 1, 'TA': 2, 'T#': 1})
>>> qg = QGrams(qval=1, start_stop='').tokenize('AATTATAT')
>>> qg
QGrams({'A': 4, 'T': 4})
>>> qg = QGrams(qval=3, start_stop='').tokenize('AATTATAT')
>>> qg
QGrams({'AAT': 1, 'ATT': 1, 'TTA': 1, 'TAT': 2, 'ATA': 1})
>>> QGrams(qval=2, start_stop='$#').tokenize('interning')
QGrams({'$i': 1, 'in': 2, 'nt': 1, 'te': 1, 'er': 1, 'rn': 1, 'ni': 1,
'ng': 1, 'g#': 1})
>>> QGrams(start_stop='', skip=1).tokenize('AACTAGAAC')
QGrams({'AC': 2, 'AT': 1, 'CA': 1, 'TG': 1, 'AA': 1, 'GA': 1, 'A': 1})
>>> QGrams(start_stop='', skip=[0, 1]).tokenize('AACTAGAAC')
QGrams({'AA': 3, 'AC': 4, 'CT': 1, 'TA': 1, 'AG': 1, 'GA': 2, 'AT': 1,
'CA': 1, 'TG': 1, 'A': 1})
>>> QGrams(qval=range(3), skip=[0, 1]).tokenize('interdisciplinarian')
QGrams({'i': 10, 'n': 7, 't': 2, 'e': 2, 'r': 4, 'd': 2, 's': 2,
'c': 2, 'p': 2, 'l': 2, 'a': 4, '$i': 1, 'in': 3, 'nt': 1, 'te': 1,
'er': 1, 'rd': 1, 'di': 1, 'is': 1, 'sc': 1, 'ci': 1, 'ip': 1, 'pl': 1,
'li': 1, 'na': 1, 'ar': 1, 'ri': 2, 'ia': 2, 'an': 1, 'n#': 1, '$n': 1,
'it': 1, 'ne': 1, 'tr': 1, 'ed': 1, 'ds': 1, 'ic': 1, 'si': 1, 'cp': 1,
'il': 1, 'pi': 1, 'ln': 1, 'nr': 1, 'ai': 1, 'ra': 1, 'a#': 1})

Added in version 0.1.0.

Changed in version 0.4.0: Broke tokenization functions out into tokenize method

tokenize(string: str) QGrams

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters:
stringstr

The string to tokenize

.. versionadded:: 0.4.0

_q_skipgrams

abydos.tokenizer._q_skipgrams.

Q-Skipgrams multi-set class

class distances._q_skipgrams.QSkipgrams(qval: int | Iterable[int] = 2, start_stop: str = '$#', scaler: str | Callable[[float], float] | None = None, ssk_lambda: float | Iterable[float] = 0.9)

Bases: _Tokenizer

A q-skipgram class, which functions like a bag/multiset.

A q-gram is here defined as all sequences of q characters. Q-grams are also known as k-grams and n-grams, but the term n-gram more typically refers to sequences of whitespace-delimited words in a string, where q-gram refers to sequences of characters in a word or string.

Added in version 0.4.0.

Methods

count()

Return token count.

count_unique()

Return the number of unique elements.

get_counter()

Return the tokens as a Counter object.

get_list()

Return the tokens as an ordered list.

get_set()

Return the unique tokens as a set.

tokenize(string)

Tokenize the term and store it.

__init__(qval: int | Iterable[int] = 2, start_stop: str = '$#', scaler: str | Callable[[float], float] | None = None, ssk_lambda: float | Iterable[float] = 0.9) None

Initialize QSkipgrams.

Parameters:
qvalint or Iterable

The q-gram length (defaults to 2), can be an integer, range object, or list

start_stopstr

A string of length >= 0 indicating start & stop symbols. If the string is ‘’, q-grams will be calculated without start & stop symbols appended to each end. Otherwise, the first character of start_stop will pad the beginning of the string and the last character of start_stop will pad the end of the string before q-grams are calculated. (In the case that start_stop is only 1 character long, the same symbol will be used for both.)

scalerNone, str, or function

A scaling function for the Counter:

  • None : no scaling

  • ‘set’ : All non-zero values are set to 1.

  • ‘length’ : Each token has weight equal to its length.

  • ‘length-log’Each token has weight equal to the log of its

    length + 1.

  • ‘length-exp’Each token has weight equal to e raised to its

    length.

  • a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.

  • ‘SSK’ : Applies weighting according to the substring kernel rules of :cite:`Lodhi:2002`.

ssk_lambdafloat or Iterable

A value in the range (0.0, 1.0) used for discouting gaps between characters according to the method described in :cite:`Lodhi:2002`. To supply multiple values of lambda, provide an Iterable of numeric values, such as (0.5, 0.05) or np.arange(0.05, 0.5, 0.05)

Raises:
ValueError

Use WhitespaceTokenizer instead of qval=0.

Examples

>>> QSkipgrams().tokenize('AATTAT')
QSkipgrams({'$A': 3, '$T': 3, '$#': 1, 'AA': 3, 'AT': 7, 'A#': 3,
'TT': 3, 'TA': 2, 'T#': 3})
>>> QSkipgrams(qval=1, start_stop='').tokenize('AATTAT')
QSkipgrams({'A': 3, 'T': 3})
>>> QSkipgrams(qval=3, start_stop='').tokenize('AATTAT')
QSkipgrams({'AAT': 5, 'AAA': 1, 'ATT': 6, 'ATA': 4, 'TTA': 1, 'TTT': 1,
'TAT': 2})
>>> QSkipgrams(start_stop='').tokenize('ABCD')
QSkipgrams({'AB': 1, 'AC': 1, 'AD': 1, 'BC': 1, 'BD': 1, 'CD': 1})
>>> QSkipgrams().tokenize('Colin')
QSkipgrams({'$C': 1, '$o': 1, '$l': 1, '$i': 1, '$n': 1, '$#': 1,
'Co': 1, 'Cl': 1, 'Ci': 1, 'Cn': 1, 'C#': 1, 'ol': 1, 'oi': 1, 'on': 1,
'o#': 1, 'li': 1, 'ln': 1, 'l#': 1, 'in': 1, 'i#': 1, 'n#': 1})
>>> QSkipgrams(qval=3).tokenize('AACTAGAAC')
QSkipgrams({'$$A': 5, '$$C': 2, '$$T': 1, '$$G': 1, '$$#': 2,
'$AA': 20, '$AC': 14, '$AT': 4, '$AG': 6, '$A#': 20, '$CT': 2,
'$CA': 6, '$CG': 2, '$CC': 2, '$C#': 8, '$TA': 6, '$TG': 2, '$TC': 2,
'$T#': 4, '$GA': 4, '$GC': 2, '$G#': 4, '$##': 2, 'AAC': 11, 'AAT': 1,
'AAA': 10, 'AAG': 3, 'AA#': 20, 'ACT': 2, 'ACA': 6, 'ACG': 2, 'ACC': 2,
'AC#': 14, 'ATA': 6, 'ATG': 2, 'ATC': 2, 'AT#': 4, 'AGA': 6, 'AGC': 3,
'AG#': 6, 'A##': 5, 'CTA': 3, 'CTG': 1, 'CTC': 1, 'CT#': 2, 'CAG': 1,
'CAA': 3, 'CAC': 3, 'CA#': 6, 'CGA': 2, 'CGC': 1, 'CG#': 2, 'CC#': 2,
'C##': 2, 'TAG': 1, 'TAA': 3, 'TAC': 3, 'TA#': 6, 'TGA': 2, 'TGC': 1,
'TG#': 2, 'TC#': 2, 'T##': 1, 'GAA': 1, 'GAC': 2, 'GA#': 4, 'GC#': 2,
'G##': 1})

QSkipgrams may also be used to produce weights in accordance with the substring kernel rules of :cite:`Lodhi:2002` by passing the scaler value 'SSK':

>>> QSkipgrams(scaler='SSK').tokenize('AACTAGAAC')
QSkipgrams(, {'$A': 2.8883286990000006, '$C': 1.0047784401000002,
'$T': 0.5904900000000001, '$G': 0.4782969000000001,
'$#': 0.31381059609000006, 'AA': 6.170192010000001, 'AC': 4.486377699,
'AT': 1.3851, 'AG': 1.931931, 'A#': 2.6526399291000002, 'CT': 0.81,
'CA': 1.850931, 'CG': 0.6561, 'CC': 0.4782969000000001,
'C#': 1.2404672100000003, 'TA': 2.05659, 'TG': 0.7290000000000001,
'TC': 0.531441, 'T#': 0.4782969000000001, 'GA': 1.5390000000000001,
'GC': 0.6561, 'G#': 0.5904900000000001})

Added in version 0.4.0.

tokenize(string: str) QSkipgrams

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters:
stringstr

The string to tokenize

.. versionadded:: 0.4.0

_ratcliff_obershelp

abydos.distance._ratcliff_obershelp.

Ratcliff-Obershelp similarity

class distances._ratcliff_obershelp.RatcliffObershelp(**kwargs: Any)

Bases: _Distance

Ratcliff-Obershelp similarity.

This follows the Ratcliff-Obershelp algorithm :cite:`Ratcliff:1988` to derive a similarity measure:

  1. Find the length of the longest common substring in src & tar.

  2. Recurse on the strings to the left & right of each this substring in src & tar. The base case is a 0 length common substring, in which case, return 0. Otherwise, return the sum of the current longest common substring and the left & right recursed sums.

  3. Multiply this length by 2 and divide by the sum of the lengths of src & tar.

Cf. http://www.drdobbs.com/database/pattern-matching-the-gestalt-approach/184407970

Added in version 0.3.6.

Methods

dist(src, tar)

Return distance.

dist_abs(src, tar)

Return absolute distance.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar)

Return the Ratcliff-Obershelp similarity of two strings.

sim(src: str, tar: str) float

Return the Ratcliff-Obershelp similarity of two strings.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
float

Ratcliff-Obershelp similarity

Examples

>>> cmp = RatcliffObershelp()
>>> round(cmp.sim('cat', 'hat'), 12)
0.666666666667
>>> round(cmp.sim('Niall', 'Neil'), 12)
0.666666666667
>>> round(cmp.sim('aluminum', 'Catalan'), 12)
0.4
>>> cmp.sim('ATCG', 'TAGC')
0.5

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

_refined_soundex

abydos.phonetic._refined_soundex.

Refined Soundex

class distances._refined_soundex.RefinedSoundex(max_length: int = -1, zero_pad: bool = False, retain_vowels: bool = False)

Bases: _Phonetic

Refined Soundex.

This is Soundex, but with more character classes. It was defined at :cite:`Boyce:1998`.

Added in version 0.3.6.

Methods

encode(word)

Return the Refined Soundex code for a word.

encode_alpha(word)

Return the alphabetic Refined Soundex code for a word.

__init__(max_length: int = -1, zero_pad: bool = False, retain_vowels: bool = False) None

Initialize RefinedSoundex instance.

Parameters:
max_lengthint

The length of the code returned (defaults to unlimited)

zero_padbool

Pad the end of the return value with 0s to achieve a max_length string

retain_vowelsbool

Retain vowels (as 0) in the resulting code

.. versionadded:: 0.4.0
encode(word: str) str

Return the Refined Soundex code for a word.

Parameters:
wordstr

The word to transform

Returns:
str

The Refined Soundex value

Examples

>>> pe = RefinedSoundex()
>>> pe.encode('Christopher')
'C93619'
>>> pe.encode('Niall')
'N7'
>>> pe.encode('Smith')
'S86'
>>> pe.encode('Schmidt')
'S386'

Added in version 0.3.0.

Changed in version 0.3.6: Encapsulated in class

encode_alpha(word: str) str

Return the alphabetic Refined Soundex code for a word.

Parameters:
wordstr

The word to transform

Returns:
str

The alphabetic Refined Soundex value

Examples

>>> pe = RefinedSoundex()
>>> pe.encode_alpha('Christopher')
'CRKTPR'
>>> pe.encode_alpha('Niall')
'NL'
>>> pe.encode_alpha('Smith')
'SNT'
>>> pe.encode_alpha('Schmidt')
'SKNT'

Added in version 0.4.0.

_regexp

abydos.tokenizer._wordpunct.

Regexp tokenizer

class distances._regexp.RegexpTokenizer(scaler: str | Callable[[float], float] | None = None, regexp: str = '\\w+', flags: int = 0)

Bases: _Tokenizer

A regexp tokenizer.

Added in version 0.4.0.

Methods

count()

Return token count.

count_unique()

Return the number of unique elements.

get_counter()

Return the tokens as a Counter object.

get_list()

Return the tokens as an ordered list.

get_set()

Return the unique tokens as a set.

tokenize(string)

Tokenize the term and store it.

__init__(scaler: str | Callable[[float], float] | None = None, regexp: str = '\\w+', flags: int = 0) None

Initialize tokenizer.

Parameters:
scalerNone, str, or function

A scaling function for the Counter:

  • None : no scaling

  • ‘set’ : All non-zero values are set to 1.

  • ‘length’ : Each token has weight equal to its length.

  • ‘length-log’Each token has weight equal to the log of its

    length + 1.

  • ‘length-exp’Each token has weight equal to e raised to its

    length.

  • a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.

regexpstr

A regular exprecssion used to match tokens in the input text.

flagsint

Flags to pass to the regular expression matcher. See the documentation on Python’s re module for details.

.. versionadded:: 0.4.0
tokenize(string: str) RegexpTokenizer

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters:
stringstr

The string to tokenize

Examples

>>> RegexpTokenizer(regexp=r'[^-]+').tokenize('AA-CT-AG-AA-CD')
RegexpTokenizer({'AA': 2, 'CT': 1, 'AG': 1, 'CD': 1})

Added in version 0.4.0.

_rouge_l

abydos.distance._rouge_l.

Rouge-L similarity

class distances._rouge_l.RougeL(**kwargs: Any)

Bases: _Distance

Rouge-L similarity.

Rouge-L similarity :cite:`Lin:2004`

Added in version 0.4.0.

Methods

dist(src, tar)

Return distance.

dist_abs(src, tar)

Return absolute distance.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar[, beta])

Return the Rouge-L similarity of two strings.

__init__(**kwargs: Any) None

Initialize RougeL instance.

Parameters:
**kwargs

Arbitrary keyword arguments

.. versionadded:: 0.4.0
sim(src: str, tar: str, beta: float = 8) float

Return the Rouge-L similarity of two strings.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

betaint or float

A weighting factor to prejudice similarity towards src

Returns:
float

Rouge-L similarity

Examples

>>> cmp = RougeL()
>>> cmp.sim('cat', 'hat')
0.6666666666666666
>>> cmp.sim('Niall', 'Neil')
0.6018518518518519
>>> cmp.sim('aluminum', 'Catalan')
0.3757225433526012
>>> cmp.sim('ATCG', 'TAGC')
0.5

Added in version 0.4.0.

_ssk

abydos.distance._ssk.

String subsequence kernel (SSK) similarity

class distances._ssk.SSK(tokenizer: _Tokenizer | None = None, ssk_lambda: float = 0.9, **kwargs: Any)

Bases: _TokenDistance

String subsequence kernel (SSK) similarity.

This is based on :cite:`Lodhi:2002`.

Added in version 0.4.1.

Methods

dist(src, tar)

Return distance.

dist_abs(src, tar)

Return absolute distance.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar)

Return the normalized SSK similarity of two strings.

sim_score(src, tar)

Return the SSK similarity of two strings.

__init__(tokenizer: _Tokenizer | None = None, ssk_lambda: float = 0.9, **kwargs: Any) None

Initialize SSK instance.

Parameters:
tokenizer_Tokenizer

A tokenizer instance from the abydos.tokenizer package

ssk_lambdafloat or Iterable

A value in the range (0.0, 1.0) used for discouting gaps between characters according to the method described in :cite:`Lodhi:2002`. To supply multiple values of lambda, provide an Iterable of numeric values, such as (0.5, 0.05) or np.arange(0.05, 0.5, 0.05)

**kwargs

Arbitrary keyword arguments

Other Parameters:
qvalint

The length of each q-skipgram. Using this parameter and tokenizer=None will cause the instance to use the QGramskipgrams tokenizer with this q value.

.. versionadded:: 0.4.1
sim(src: str, tar: str) float

Return the normalized SSK similarity of two strings.

Parameters:
srcstr

Source string (or QGrams/Counter objects) for comparison

tarstr

Target string (or QGrams/Counter objects) for comparison

Returns:
float

Normalized string subsequence kernel similarity

Examples

>>> cmp = SSK()
>>> cmp.sim('cat', 'hat')
0.3558718861209964
>>> cmp.sim('Niall', 'Neil')
0.4709007822130597
>>> cmp.sim('aluminum', 'Catalan')
0.13760157193822603
>>> cmp.sim('ATCG', 'TAGC')
0.6140899528060498

Added in version 0.4.1.

sim_score(src: str, tar: str) float

Return the SSK similarity of two strings.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
float

String subsequence kernel similarity

Examples

>>> cmp = SSK()
>>> cmp.dist_abs('cat', 'hat')
0.6441281138790036
>>> cmp.dist_abs('Niall', 'Neil')
0.5290992177869402
>>> cmp.dist_abs('aluminum', 'Catalan')
0.862398428061774
>>> cmp.dist_abs('ATCG', 'TAGC')
0.38591004719395017

Added in version 0.4.1.

_tichy

abydos.distance._tichy.

Tichy edit distance

class distances._tichy.Tichy(cost: Tuple[int, int] = (1, 1), **kwargs: Any)

Bases: _Distance

Tichy edit distance.

Tichy described an algorithm, implemented below, in :cite:`Tichy:1984`. Following this, :cite:`Cormode:2003` identifies an interpretation of this algorithm’s output as a distance measure, which is largely followed by the methods below.

Tichy’s algorithm locates substrings of a string S to be copied in order to create a string T. The only other operation used by his algorithms for string reconstruction are add operations.

Methods

dist(src, tar)

Return the normalized Tichy edit distance between two strings.

dist_abs(src, tar)

Return the Tichy distance between two strings.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar)

Return similarity.

Notes

While :cite:`Cormode:2003` counts only move operations to calculate distance, I give the option (enabled by default) of counting add operations as part of the distance measure. To ignore the cost of add operations, set the cost value to (1, 0), for example, when initializing the object. Further, in the case that S and T are identical, a distance of 0 will be returned, even though this would still be counted as a single move operation spanning the whole of string S.

Added in version 0.4.0.

__init__(cost: Tuple[int, int] = (1, 1), **kwargs: Any) None

Initialize Tichy instance.

Parameters:
costtuple

A 2-tuple representing the cost of the two possible edits: block moves and adds (by default: (1, 1))

**kwargs

Arbitrary keyword arguments

.. versionadded:: 0.4.0
dist(src: str, tar: str) float

Return the normalized Tichy edit distance between two strings.

The Tichy distance is normalized by dividing the distance by the length of the tar string.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
float

The normalized Tichy distance between src & tar

Examples

>>> cmp = Tichy()
>>> round(cmp.dist('cat', 'hat'), 12)
0.666666666667
>>> round(cmp.dist('Niall', 'Neil'), 12)
1.0
>>> cmp.dist('aluminum', 'Catalan')
0.8571428571428571
>>> cmp.dist('ATCG', 'TAGC')
1.0

Added in version 0.4.0.

dist_abs(src: str, tar: str) float

Return the Tichy distance between two strings.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
int (may return a float if cost has float values)

The Tichy distance between src & tar

Examples

>>> cmp = Tichy()
>>> cmp.dist_abs('cat', 'hat')
2
>>> cmp.dist_abs('Niall', 'Neil')
4
>>> cmp.dist_abs('aluminum', 'Catalan')
6
>>> cmp.dist_abs('ATCG', 'TAGC')
4

Added in version 0.4.0.

_tokenizer

abydos.tokenizer._tokenize.

_Tokenizer base class

_token_distance

abydos.distance._token_distance.

The distance._token_distance._TokenDistance module implements abstract class _TokenDistance.

_typo

abydos.distance._typo.

Typo edit distance functions.

class distances._typo.Typo(metric: str = 'euclidean', cost: Tuple[float, float, float, float] = (1.0, 1.0, 0.5, 0.5), layout: str = 'QWERTY', failsafe: bool = False, **kwargs: Any)

Bases: _Distance

Typo distance.

This is inspired by Typo-Distance :cite:`Song:2011`, and a fair bit of this was copied from that module. Compared to the original, this supports different metrics for substitution.

Added in version 0.3.6.

Methods

dist(src, tar)

Return the normalized typo distance between two strings.

dist_abs(src, tar)

Return the typo distance between two strings.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar)

Return similarity.

__init__(metric: str = 'euclidean', cost: Tuple[float, float, float, float] = (1.0, 1.0, 0.5, 0.5), layout: str = 'QWERTY', failsafe: bool = False, **kwargs: Any)

Initialize Typo instance.

Parameters:
metricstr

Supported values include: euclidean, manhattan, log-euclidean, and log-manhattan

costtuple

A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and shift, respectively (by default: (1, 1, 0.5, 0.5)) The substitution & shift costs should be significantly less than the cost of an insertion & deletion unless a log metric is used.

layoutstr

Name of the keyboard layout to use (Currently supported: QWERTY, Dvorak, AZERTY, QWERTZ, auto). If auto is selected, the class will attempt to determine an appropriate keyboard based on the supplied words.

failsafebool

If True, substitution of an unknown character (one not present on the selected keyboard) will incur a cost equal to an insertion plus a deletion.

**kwargs

Arbitrary keyword arguments

.. versionadded:: 0.4.0
dist(src: str, tar: str) float

Return the normalized typo distance between two strings.

This is typo distance, normalized to [0, 1].

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
float

Normalized typo distance

Examples

>>> cmp = Typo()
>>> round(cmp.dist('cat', 'hat'), 12)
0.527046276695
>>> round(cmp.dist('Niall', 'Neil'), 12)
0.565028153987
>>> round(cmp.dist('Colin', 'Cuilen'), 12)
0.569035593729
>>> cmp.dist('ATCG', 'TAGC')
0.625

Added in version 0.3.0.

Changed in version 0.3.6: Encapsulated in class

dist_abs(src: str, tar: str) float

Return the typo distance between two strings.

Parameters:
srcstr

Source string for comparison

tarstr

Target string for comparison

Returns:
float

Typo distance

Raises:
ValueError

char not found in any keyboard layouts

Examples

>>> cmp = Typo()
>>> cmp.dist_abs('cat', 'hat')
1.5811388300841898
>>> cmp.dist_abs('Niall', 'Neil')
2.8251407699364424
>>> cmp.dist_abs('Colin', 'Cuilen')
3.414213562373095
>>> cmp.dist_abs('ATCG', 'TAGC')
2.5
>>> cmp = Typo(metric='manhattan')
>>> cmp.dist_abs('cat', 'hat')
2.0
>>> cmp.dist_abs('Niall', 'Neil')
3.0
>>> cmp.dist_abs('Colin', 'Cuilen')
3.5
>>> cmp.dist_abs('ATCG', 'TAGC')
2.5
>>> cmp = Typo(metric='log-manhattan')
>>> cmp.dist_abs('cat', 'hat')
0.8047189562170501
>>> cmp.dist_abs('Niall', 'Neil')
2.2424533248940004
>>> cmp.dist_abs('Colin', 'Cuilen')
2.242453324894
>>> cmp.dist_abs('ATCG', 'TAGC')
2.3465735902799727

Added in version 0.3.0.

Changed in version 0.3.6: Encapsulated in class

_warrens_iv

abydos.distance._warrens_iv.

Warrens IV similarity

class distances._warrens_iv.WarrensIV(alphabet: Counter[str] | Sequence[str] | Set[str] | int | None = None, tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any)

Bases: _TokenDistance

Warrens IV similarity.

For two sets X and Y and a population N, Warrens IV similarity :cite:`Warrens:2008` is

\[sim_{WarrensIV}(X, Y) = \frac{4|X \cap Y| \cdot |(N \setminus X) \setminus Y|} {4|X \cap Y| \cdot |(N \setminus X) \setminus Y| + (|X \cap Y| + |(N \setminus X) \setminus Y|) (|X \setminus Y| + |Y \setminus X|)}\]

In 2x2 confusion table terms, where a+b+c+d=n, this is

\[sim_{WarrensIV} = \frac{4ad}{4ad + (a+d)(b+c)}\]

Added in version 0.4.0.

Methods

dist(src, tar)

Return distance.

dist_abs(src, tar)

Return absolute distance.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar)

Return the Warrens IV similarity of two strings.

__init__(alphabet: Counter[str] | Sequence[str] | Set[str] | int | None = None, tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any) None

Initialize WarrensIV instance.

Parameters:
alphabetCounter, collection, int, or None

This represents the alphabet of possible tokens. See alphabet description in _TokenDistance for details.

tokenizer_Tokenizer

A tokenizer instance from the abydos.tokenizer package

intersection_typestr

Specifies the intersection type, and set type as a result: See intersection_type description in _TokenDistance for details.

**kwargs

Arbitrary keyword arguments

Other Parameters:
qvalint

The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.

metric_Distance

A string distance measure class for use in the soft and fuzzy variants.

thresholdfloat

A threshold value, similarities above which are counted as members of the intersection for the fuzzy variant.

.. versionadded:: 0.4.0
sim(src: str, tar: str) float

Return the Warrens IV similarity of two strings.

Parameters:
srcstr

Source string (or QGrams/Counter objects) for comparison

tarstr

Target string (or QGrams/Counter objects) for comparison

Returns:
float

Warrens IV similarity

Examples

>>> cmp = WarrensIV()
>>> cmp.sim('cat', 'hat')
0.666095890410959
>>> cmp.sim('Niall', 'Neil')
0.5326918120113412
>>> cmp.sim('aluminum', 'Catalan')
0.21031040612607685
>>> cmp.sim('ATCG', 'TAGC')
0.0

Added in version 0.4.0.

_weighted_jaccard

abydos.distance._weighted_jaccard.

Weighted Jaccard similarity

class distances._weighted_jaccard.WeightedJaccard(tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', weight: int = 3, **kwargs: Any)

Bases: _TokenDistance

Weighted Jaccard similarity.

For two sets X and Y and a weight w, the Weighted Jaccard similarity :cite:`Legendre:1998` is

\[sim_{Jaccard_w}(X, Y) = \frac{w \cdot |X \cap Y|} {w \cdot |X \cap Y| + |X \setminus Y| + |Y \setminus X|}\]

Here, the intersection between the two sets is weighted by w. Compare to Jaccard similarity (\(w = 1\)), and to Dice similarity (\(w = 2\)). In the default case, the weight of the intersection is 3, following :cite:`Legendre:1998`.

In 2x2 confusion table terms, where a+b+c+d=n, this is

\[sim_{Jaccard_w} = \frac{w\cdot a}{w\cdot a+b+c}\]

Added in version 0.4.0.

Methods

dist(src, tar)

Return distance.

dist_abs(src, tar)

Return absolute distance.

set_params(**kwargs)

Store params in the params dict.

sim(src, tar)

Return the Triple Weighted Jaccard similarity of two strings.

__init__(tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', weight: int = 3, **kwargs: Any) None

Initialize TripleWeightedJaccard instance.

Parameters:
tokenizer_Tokenizer

A tokenizer instance from the abydos.tokenizer package

intersection_typestr

Specifies the intersection type, and set type as a result: See intersection_type description in _TokenDistance for details.

weightint

The weight to apply to the intersection cardinality. (3, by default.)

**kwargs

Arbitrary keyword arguments

Other Parameters:
qvalint

The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.

metric_Distance

A string distance measure class for use in the soft and fuzzy variants.

thresholdfloat

A threshold value, similarities above which are counted as members of the intersection for the fuzzy variant.

.. versionadded:: 0.4.0
sim(src: str, tar: str) float

Return the Triple Weighted Jaccard similarity of two strings.

Parameters:
srcstr

Source string (or QGrams/Counter objects) for comparison

tarstr

Target string (or QGrams/Counter objects) for comparison

Returns:
float

Weighted Jaccard similarity

Examples

>>> cmp = WeightedJaccard()
>>> cmp.sim('cat', 'hat')
0.6
>>> cmp.sim('Niall', 'Neil')
0.46153846153846156
>>> cmp.sim('aluminum', 'Catalan')
0.16666666666666666
>>> cmp.sim('ATCG', 'TAGC')
0.0

Added in version 0.4.0.

_whitespace

abydos.tokenizer._whitespace.

Whitespace tokenizer

class distances._whitespace.WhitespaceTokenizer(scaler: str | Callable[[float], float] | None = None, flags: int = 0)

Bases: RegexpTokenizer

A whitespace tokenizer.

Methods

count()

Return token count.

count_unique()

Return the number of unique elements.

get_counter()

Return the tokens as a Counter object.

get_list()

Return the tokens as an ordered list.

get_set()

Return the unique tokens as a set.

tokenize(string)

Tokenize the term and store it.

Examples

>>> WhitespaceTokenizer().tokenize('a b c f a c g e a b')
WhitespaceTokenizer({'a': 3, 'b': 2, 'c': 2, 'f': 1, 'g': 1, 'e': 1})

Added in version 0.4.0.

__init__(scaler: str | Callable[[float], float] | None = None, flags: int = 0) None

Initialize tokenizer.

Parameters:
scalerNone, str, or function

A scaling function for the Counter:

  • None : no scaling

  • ‘set’ : All non-zero values are set to 1.

  • ‘length’ : Each token has weight equal to its length.

  • ‘length-log’Each token has weight equal to the log of its

    length + 1.

  • ‘length-exp’Each token has weight equal to e raised to its

    length.

  • a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.

flagsint

Flags to pass to the regular expression matcher. See the documentation on Python’s re module for details.

.. versionadded:: 0.4.0