name_matching

run_nm

name_matching.run_nm.match_names(data_first: DataFrame | Series, data_second: DataFrame | Series, column_first='', column_second='', group_column_first='', group_column_second='', case_sensitive=False, punctuation_sensitive=False, special_character_sensitive=False, threshold=95, **kwargs) DataFrame

Function which performs name matching. First a simple merge on the data is performed to get the instances in which the name matches perfectly. Subsequently the matches are matched using the name matching algorithm as defined in name_matcher.

Parameters:
data_first: Union[pd.DataFrame, pd.Series]

The first dataframe or series used for the name matching

data_second: Union[pd.DataFrame, pd.Series]

The second dataframe or series used for the name matching, for matching the data to itself data_second should be equal to data first

column_first: str

If data_first is a dataframe column_first should be the column in which the name that should be matched can be found for data_first default=’’

column_second: str

If data_second is a dataframe column_second should be the column in which the name that should be matched can be found for data_second default=’’

group_column_first: str

The name of the column that should be used to generate groups within the data_first dataframe. The matchig is then only performed for instances in which the groups are identical default=’’

group_column_second: str

The name of the column that should be used to generate groups within the data_second dataframe. The matchig is then only performed for instances in which the groups are identical default=’’

case_sensitive: bool

Boolean value indicating whether the names should be converted to lower case names before the name matching starts. If False all the characters are converted to lowercase default=False

punctuation_sensitive: bool

Boolean value indicating whether punctuations should be removed from the original names before the name matching starts. If False the punctuations are removed default=False

special_character_sensitive: bool

Boolean value indicating whether special characters should be converted to unicode before the name matching starts. If False the special characters are replaced default=False

threshold: int

the minimal score a match should have to be part of the output default=95

**kwargs

Additional inputs for the name_matcher

Returns:
pd.DataFrame

A dataframe containing the matched rows were the match score is above the threshold. The dataframe consists of 4 columns; original_name: the original name from data_first after preprocessing, match_name_0: the name it is matched to from data_second after preprocessing, score_0: the score of the match, match_index_0: the index of the match in data_second. The match_index_0 can be used to join the data from both dataframes.

name_matcher

class name_matching.name_matcher.NameMatcher(ngrams: tuple = (2, 3), top_n: int = 50, low_memory: bool = False, number_of_rows: int = 5000, number_of_matches: int = 1, lowercase: bool = True, non_word_characters: bool | None = None, remove_ascii: bool = True, punctuations: bool | None = None, legal_suffixes: bool = False, preprocess_legal: bool = False, delete_legal: bool = False, make_abbreviations: bool = True, common_words: bool | list = False, cut_off_no_scoring_words: float = 0.01, preprocess_split: bool = False, begin_end_legal_pre_suffix: bool = True, verbose: bool = True, distance_metrics: list | tuple = ['overlap', 'weighted_jaccard', 'ratcliff_obershelp', 'fuzzy_wuzzy_token_sort', 'editex'], row_numbers: bool = False, return_algorithms_score: bool = False, save_intermediate_results: bool = False, load_intermediate_results: bool = False, intermediate_results_name: dict[str, str] = {'matching_data': 'df_matching_data_name', 'possible_matches': 'possible_matches_name', 'to_be_matched': 'to_be_matched_name'})

Bases: object

Name matching using character n-gram cosine similarity followed by fuzzy matching.

This class first vectorizes names using character n-grams (via TF-IDF and cosine similarity), selects the top N candidates, and then applies multiple fuzzy matching distance metrics to pick the best match or top-K matches.

Parameters:
ngramstuple of int, default=(2, 3)

Character n-gram lengths used for cosine similarity.

top_nint, default=50

Number of top candidates to return from the cosine step.

low_memorybool, default=False

If True, uses a low-memory approach for sparse cosine similarity.

number_of_rowsint, default=5000

Batch size for low-memory sparse cosine similarity. Ignored if low_memory is True.

number_of_matchesint, default=1

Number of fuzzy-matched alternatives to return.

lowercasebool, default=True

If True, converts text to lowercase during preprocessing.

non_word_charactersOptional[bool], default=True

If True, strips non-word characters (excluding & and #) during preprocessing.

remove_asciibool, default=True

If True, transliterates to ASCII (dropping accents) during preprocessing.

punctuationsOptional[bool], default=None

Deprecated alias for non_word_characters.

legal_suffixesbool, default=False

If True, post-processing will ignore common company legal suffixes in scoring.

preprocess_legalbool, default=False

If True, strips or abbreviates legal suffixes/prefixes during preprocessing.

delete_legalbool, default=False

If True, deletes legal suffixes/prefixes instead of abbreviating them.

make_abbreviationsbool, default=True

If True, replaces common words with their abbreviations during preprocessing.

common_wordsUnion[bool, list], default=False

If True, will post-process to down-weight the most common words. If a list is provided, those specific words will be down-weighted.

cut_off_no_scoring_wordsfloat, default=0.01

Threshold (fraction of max frequency) above which a word is considered too common.

preprocess_splitbool, default=False

If True, performs an additional “split” variant of preprocessing for searching.

begin_end_legal_pre_suffixbool, default=True

If True, only abbreviate legal terms at the beginning or end of names.

verbosebool, default=True

If True, prints progress via tqdm.

distance_metricslist of str, default=[

“overlap”, “weighted_jaccard”, “ratcliff_obershelp”, “fuzzy_wuzzy_token_sort”, “editex”] List of distance metric names to use in the fuzzy-matching step.

row_numbersbool, default=False

If True, returns original DataFrame index values in the match results.

return_algorithms_scorebool, default=False

If True, return the full per-algorithm score matrix instead of just combined scores.

save_intermediate_resultsbool, default=False

If True, saves intermediate pickle files for matching_data, to_be_matched, possible_matches.

load_intermediate_resultsbool, default=False

If True, attempts to load intermediate pickle files before recomputing.

intermediate_results_namedict of str to str, default={

“matching_data”: “df_matching_data_name”, “to_be_matched”: “to_be_matched_name”, “possible_matches”: “possible_matches_name”

}

Filenames (without “.pkl”) for saving/loading intermediate results.

Methods

fuzzy_matches(possible_matches, to_be_matched)

A method which performs the fuzzy matching between the data in the to_be_matched series as well as the indicated indexes of the matching_data points which are possible matching candidates.

load_and_process_master_data(column, ...[, ...])

Load the matching data into the NameMatcher and start the preprocessing.

match_names(to_be_matched, column_matching)

Match input names against the preprocessed master data.

postprocess(match)

Postprocesses the scores to exclude certain specific company words or the most common words.

preprocess(df, column_name[, original_name])

Preprocess a dataframe before applying a name matching algorithm.

set_distance_metrics(metrics)

A method to set which of the distance metrics should be employed during the fuzzy matching.

transform_data()

A method which transforms the matching data based on the ngrams transformer.

unicode_to_ascii(text)

Converts a string to ascii characters trhough transliteration.

Raises:
TypeError

If common_words is not a bool or iterable of strings.

__init__(ngrams: tuple = (2, 3), top_n: int = 50, low_memory: bool = False, number_of_rows: int = 5000, number_of_matches: int = 1, lowercase: bool = True, non_word_characters: bool | None = None, remove_ascii: bool = True, punctuations: bool | None = None, legal_suffixes: bool = False, preprocess_legal: bool = False, delete_legal: bool = False, make_abbreviations: bool = True, common_words: bool | list = False, cut_off_no_scoring_words: float = 0.01, preprocess_split: bool = False, begin_end_legal_pre_suffix: bool = True, verbose: bool = True, distance_metrics: list | tuple = ['overlap', 'weighted_jaccard', 'ratcliff_obershelp', 'fuzzy_wuzzy_token_sort', 'editex'], row_numbers: bool = False, return_algorithms_score: bool = False, save_intermediate_results: bool = False, load_intermediate_results: bool = False, intermediate_results_name: dict[str, str] = {'matching_data': 'df_matching_data_name', 'possible_matches': 'possible_matches_name', 'to_be_matched': 'to_be_matched_name'})
fuzzy_matches(possible_matches: array, to_be_matched: Series) Series

A method which performs the fuzzy matching between the data in the to_be_matched series as well as the indicated indexes of the matching_data points which are possible matching candidates.

Parameters:
possible_matchesnp.array

An array containing the indexes of the matching data with potential matches

to_be_matchedpd.Series

The data which should be matched

Returns:
pd.Series

A series containing the match index from the matching_data dataframe. the name in the to_be_matched data, the name to which the datapoint was matched and a score between 0 (no match) and 100(perfect match) to indicate the quality of the matches.

load_and_process_master_data(column: str, df_matching_data: DataFrame, start_processing: bool = True, transform: bool = True) None

Load the matching data into the NameMatcher and start the preprocessing.

Parameters:
columnstring

The column name of the dataframe which should be used for the matching

df_matching_data: pd.DataFrame

The dataframe which is used to match the data to.

start_processingbool

A boolean indicating whether to start the preprocessing step after loading the matching data. If transform is True the data will still be transformed and the preprocessing will be marked as completed. default: True

transformbool

A boolean indicating whether or not the data should be transformed after the vectoriser is initialised default: True

match_names(to_be_matched: Series | DataFrame, column_matching: str) Series | DataFrame | Tuple[DataFrame, DataFrame]

Match input names against the preprocessed master data.

This will:
  1. Preprocess the new names.

  2. Compute cosine-similarity top-N candidates.

  3. Apply fuzzy matching to those candidates.

Parameters:
to_be_matchedpandas Series or DataFrame

New names to match.

column_matchingstr

Column name in to_be_matched containing the names.

Returns:
pandas Series or DataFrame

If return_algorithms_score=False and number_of_matches=1, returns a DataFrame containing:

  • original_name

  • match_name

  • score

  • match_index

If return_algorithms_score=False and number_of_matches>1, returns a DataFrame with columns for each alternative.

If return_algorithms_score=True, returns a tuple ( DataFrame_of_scores, DataFrame_of_matched_names ).

postprocess(match: Series) Series

Postprocesses the scores to exclude certain specific company words or the most common words. In this method only the scores are adjusted, the matches still stand.

Parameters:
matchpd.Series

The series with the possible matches and original scores

Returns:
pd.Series

A new version of the input series with updated scores

preprocess(df: DataFrame, column_name: str, original_name: bool = False) DataFrame

Preprocess a dataframe before applying a name matching algorithm. The preprocessing consists of removing special characters, spaces, converting all characters to lower case and removing the words given in the word lists

Parameters:
dfDataFrame

The dataframe or series on which the preprocessing needs to be performed

column_namestr

The name of the column that is used for the preprocessing

original_namebool

If True, returns an additional column ‘original_name’ in the dataframe this column holds the original, non-processed name. default=False

Returns:
pd.DataFrame

The preprocessed dataframe or series depending on the input

set_distance_metrics(metrics: List) None

A method to set which of the distance metrics should be employed during the fuzzy matching. For very short explanations of most of the name matching algorithms please see the make_distance_metrics function in distance_matrics.py

Parameters:
metrics: List

The list with the distance metrics to be used during the name matching. The distance metrics can be chosen from the list below:

indel discounted_levenshtein tichy cormodeL_z iterative_sub_string baulieu_xiii clement dice_asymmetricI kuhns_iii overlap pearson_ii weighted_jaccard warrens_iv bag rouge_l q_grams ratcliff_obershelp ncd_bz2 fuzzy_wuzzy_partial_string fuzzy_wuzzy_token_sort fuzzy_wuzzy_token_set editex typo lig_3 ssk refined_soundex double_metaphone

transform_data() None

A method which transforms the matching data based on the ngrams transformer. After the transformation (the generation of the ngrams), the data is normalised by dividing each row by the sum of the row. Subsequently the data is changed to a coo sparse matrix format with the column indices in ascending order.

unicode_to_ascii(text: str) str

Converts a string to ascii characters trhough transliteration. The transliteration map is stored in the transliterations.py file in the data folder.

Parameters:
teststr

The text to be transliterated to ascii characters

Returns:
str

The process text without any non-ascii characters

distance_metrics

name_matching.distance_metrics.make_distance_metrics(indel: bool | dict = False, discounted_levenshtein: bool | dict = False, levenshtein: bool | dict = False, jaro_winkler: bool | dict = False, tichy: bool | dict = False, cormodel_z: bool | dict = False, iterative_sub_string: bool | dict = False, baulieu_xiii: bool | dict = False, clement: bool | dict = False, dice_asymmetrici: bool | dict = False, kuhns_iii: bool | dict = False, overlap: bool | dict = False, pearson_ii: bool | dict = False, weighted_jaccard: bool | dict = False, warrens_iv: bool | dict = False, bag: bool | dict = False, rouge_l: bool | dict = False, token_distance: bool | dict = False, ratcliff_obershelp: bool | dict = False, ncd_bz2: bool | dict = False, fuzzy_wuzzy_partial_string: bool | dict = False, fuzzy_wuzzy_token_sort: bool | dict = False, fuzzy_wuzzy_token_set: bool | dict = False, editex: bool | dict = False, typo: bool | dict = False, lig_3: bool | dict = False, ssk: bool | dict = False, refined_soundex: bool | dict = False, double_metaphone: bool | dict = False) dict

A function which returns a dict containing the distance metrics that should be used during the fuzzy string matching

Levenshtein edit distance
  • Indel

  • Discounted Levenshtein

  • levenshtein

  • LIG3

  • Jaro-Winkler

Block edit distances
  • Tichy

  • CormodeLZ

Multi-set token-based distance
  • BaulieuXIII

  • Clement

  • DiceAsymmetricI

  • KuhnsIII

  • Overlap

  • PearsonII

  • WeightedJaccard

  • WarrensIV

  • Bag

  • RougeL

  • Token distance

Subsequence distances
  • IterativeSubString

  • RatcliffObershelp

  • SSK

Normalized compression distance
  • NCDbz2

FuzzyWuzzy distances
  • FuzzyWuzzyPartialString

  • FuzzyWuzzyTokenSort

  • FuzzyWuzzyTokenSet

Ponetic distances
  • RefinedSoundex

  • DoubleMetaphone

Edit distances
  • Editex

  • Typo

Parameters:
indel: bool

Boolean indicating whether the Indel method should be used during the fuzzy name matching. The indel method is equal to a regular levenshtein distance with a twice as high substitution weight. If a dictionary is provided, it is used as parameters for the indel distance metric. default=False

discounted_levenshtein: bool

Boolean indicating whether the DiscountedLevenshtein method should be used during the fuzzy name matching. Equal to the regular levenshtein distance, only errors later in the string are counted at a discounted rate. To limit the importance of for instance suffix differences. If a dictionary is provided, it is used as parameters for the discounted_levenshtein distance metric. default=False

levenshtein: bool

Boolean indicating whether the Levenshtein method should be used during the fuzzy name matching. If a dictionary is provided, it is used as parameters for the levenshtein distance metric. default=False

jaro_winkler: bool

Boolean indicating whether the JaroWinkler method should be used during the fuzzy name matching. If a dictionary is provided, it is used as parameters for the jaro winkler distance metric. default=False

tichy: bool

Boolean indicating whether the Tichy method should be used during the fuzzy name matching. This algorithm provides a shortest edit distance based on substring and add operations. If a dictionary is provided, it is used as parameters for the tichy distance metric. default=False

cormodel_z: bool

Boolean indicating whether the CormodeLZ method should be used during the fuzzy name matching. The CormodeLZ distance between strings x and y, is the minimum number of single characters or substrings of y or of the partially built string which are required to produce x from left to right. If a dictionary is provided, it is used as parameters for the cormodel_z distance metric. default=False

iterative_sub_string: bool

Boolean indicating whether the IterativeSubString method should be used during the fuzzy name matching. A method that counts the similarities between two strings substrings and subtracts the differences taking into account the winkler similarity between the string and the substring. If a dictionary is provided, it is used as parameters for the iterative_sub_string distance metric. default=False

baulieu_xiii: bool

Boolean indicating whether the BaulieuXIII method should be used during the fuzzy name matching. The Baulieu XIII distance between two strings is given by the following formula: (|X \ Y| + |Y \ X|) / ( |X ∩ Y| + |X \ Y| + |Y \ X| + |X ∩ Y| ∙ (|X ∩ Y| - 4)^2). If a dictionary is provided, it is used as parameters for the baulieu_xiii distance metric. default=False

clement: bool

Boolean indicating whether the Clement method should be used during the fuzzy name matching. The Clement distance between two strings is given by the following formula: (|X ∩ Y|/|X|)*(1-|X|/|N|) + (|(N \ X) \ Y|/|N \ X|) * (1-|N \ X|/|N|). If a dictionary is provided, it is used as parameters for the clement distance metric. default=False

dice_asymmetrici: bool

Boolean indicating whether the DiceAsymmetricI method should be used during the fuzzy name matching. The Dice asymmetric similarity is given be |X ∩ Y|/|X|. If a dictionary is provided, it is used as parameters for the dice_asymmetrici distance metric. default=False

kuhns_iii: bool

Boolean indicating whether the KuhnsIII method should be used during the fuzzy name matching. If a dictionary is provided, it is used as parameters for the kuhns_iii distance metric. default=False

overlap: bool

Boolean indicating whether the Overlap method should be used during the fuzzy name matching. The overlap distance is given by: |X ∩ Y|/min(|X|,|Y|). If a dictionary is provided, it is used as parameters for the overlap distance metric. default=True

pearson_ii: bool

Boolean indicating whether the PearsonII method should be used during the fuzzy name matching. This algorithm is based on the Phi coefficient or the mean square contingency. If a dictionary is provided, it is used as parameters for the pearson_ii distance metric. default=False

weighted_jaccard: bool

Boolean indicating whether the WeightedJaccard method should be used during the fuzzy name matching. This is the Jaccard distance only using a wheighing for the differences of 3. If a dictionary is provided, it is used as parameters for the weighted_jaccard distance metric. default=True

warrens_iv: bool

Boolean indicating whether the WarrensIV method should be used during the fuzzy name matching. If a dictionary is provided, it is used as parameters for the warrens_iv distance metric. default=False

bag: bool
Boolean indicating whether the Bag method should be used during the fuzzy

name matching. Is a simplification of the regular edit distance by using a similarity tree structure. If a dictionary is provided,

it is used as parameters for the bag distance metric. default=False

rouge_l: bool

Boolean indicating whether the ROUGE-L method should be used during the fuzzy name matching. The ROGUE-L method is a measure that counts the longest substring between to strings. If a dictionary is provided, it is used as parameters for the rouge_l distance metric. default=False

token_distance: bool

Boolean indicating whether the a token distance method should be used during the fuzzy name matching. If a dictionary is provided, it is used as parameters for the tokenised distance metrics. default=False

ratcliff_obershelp: bool

Boolean indicating whether the RatcliffObershelp method should be used during the fuzzy name matching. This method finds the longest common substring and evaluates the longest common substrings to the right and the left of the original longest common substring. If a dictionary is provided, it is used as parameters for the ratcliff_obershelp distance metric. default=True

ncd_bz2: bool

Boolean indicating whether the NCDbz2 method should be used during the fuzzy name matching. Applies the Burrows-Wheeler transform to the strings and subsequently returns the normalised compression distance. If a dictionary is provided, it is used as parameters for the ncd_bz2 distance metric. default=False

fuzzy_wuzzy_partial_string: bool

Boolean indicating whether the FuzzyWuzzyPartialString method should be used during the fuzzy name matching. This methods takes the length of the longest common substring and divides it over the minimum of the length of each of the two strings. If a dictionary is provided, it is used as parameters for the fuzzy_wuzzy_partial_string distance metric. default=False

fuzzy_wuzzy_token_sort: bool

Boolean indicating whether the FuzzyWuzzyTokenSort method should be used during the fuzzy name matching. This tokenizes the words in the string and sorts them, subsequently a hamming distance is calculated. If a dictionary is provided, it is used as parameters for the fuzzy_wuzzy_token_sort distance metric. default=True

fuzzy_wuzzy_token_set: bool

Boolean indicating whether the FuzzyWuzzyTokenSet method should be used during the fuzzy name matching. This method tokenizes the strings and find the largest intersection of the two substrings and divides it over the length of the shortest string. If a dictionary is provided, it is used as parameters for the fuzzy_wuzzy_token_set distance metric. default=False

editex: bool

Boolean indicating whether the Editex method should be used during the fuzzy name matching. If a dictionary is provided, it is used as parameters for the editex distance metric. default=True

typo: bool

Boolean indicating whether the Typo method should be used during the fuzzy name matching. The typo distance is calculated based on the distance on a keyboard between edits. If a dictionary is provided, it is used as parameters for the typo distance metric. default=False

lig_3: bool

Boolean indicating whether the LIG3 method should be used during the fuzzy name matching. If a dictionary is provided, it is used as parameters for the lig_3 distance metric. default=False

ssk: bool

Boolean indicating whether the SSK method should be used during the fuzzy name matching. The ssk algorithm looks at the string kernel generated by all the possible different subsequences present between the two strings. If a dictionary is provided, it is used as parameters for the ssk distance metric. default=False

refined_soundex: bool

Boolean indicating whether the string should be represented by the RefinedSoundex phonetix translation and the Levensthein distance of the translated strings should be included in the fuzzy matching process. If a dictionary is provided, it is used as parameters for the refined_soundex distance metric. default=False

double_metaphone: bool

Boolean indicating whether the string should be represented by the DoubleMetaphone phonetix translation and the Levensthein distance of the translated strings should be included in the fuzzy matching process. If a dictionary is provided, it is used as parameters for the Double Metaphone distance metric. default=False

sparse_cosine

name_matching.sparse_cosine.sparse_cosine_top_n(matrix_a: csc_matrix | coo_matrix, matrix_b: csc_matrix, top_n: int, low_memory: bool, number_of_rows: int, verbose: bool)

Calculates the top_n cosine matches between matrix_a and matrix_b. Takes into account the amount of memory that should be used based on the low_memory int

Parameters

matrix_a: csc_matric

The largest sparse csc matrix which should be multiplied

matrix_b: csc_matric

The smallest sparse csc matrix which should be multiplied

top_n: int

The best n matches that should be returned

low_memory: bool

A bool indicating whether the low memory sparse cosine approach should be used

number_of_rows: int

An int inidcating the number of rows which should be processed at once when calculating the cosine simalarity

verbose: bool

A boolean indicating whether the progress should be printed

Returns:
np.array

The indexes for the n best sparse cosine matches between matrix a and b