`name_matching`¶

`run_nm`¶

name_matching.run_nm.match_names(data_first: DataFrame | Series, data_second: DataFrame | Series, column_first='', column_second='', group_column_first='', group_column_second='', case_sensitive=False, punctuation_sensitive=False, special_character_sensitive=False, threshold=95, **kwargs) → DataFrame¶

Function which performs name matching. First a simple merge on the data is performed to get the instances in which the name matches perfectly. Subsequently the matches are matched using the name matching algorithm as defined in name_matcher.

Parameters:

data_first: Union[pd.DataFrame, pd.Series]: The first dataframe or series used for the name matching
data_second: Union[pd.DataFrame, pd.Series]: The second dataframe or series used for the name matching, for matching the data to itself data_second should be equal to data first
column_first: str: If data_first is a dataframe column_first should be the column in which the name that should be matched can be found for data_first default=’’
column_second: str: If data_second is a dataframe column_second should be the column in which the name that should be matched can be found for data_second default=’’
group_column_first: str: The name of the column that should be used to generate groups within the data_first dataframe. The matchig is then only performed for instances in which the groups are identical default=’’
group_column_second: str: The name of the column that should be used to generate groups within the data_second dataframe. The matchig is then only performed for instances in which the groups are identical default=’’
case_sensitive: bool: Boolean value indicating whether the names should be converted to lower case names before the name matching starts. If False all the characters are converted to lowercase default=False
punctuation_sensitive: bool: Boolean value indicating whether punctuations should be removed from the original names before the name matching starts. If False the punctuations are removed default=False
special_character_sensitive: bool: Boolean value indicating whether special characters should be converted to unicode before the name matching starts. If False the special characters are replaced default=False
threshold: int: the minimal score a match should have to be part of the output default=95
**kwargs: Additional inputs for the name_matcher

Returns:

pd.DataFrame: A dataframe containing the matched rows were the match score is above the threshold. The dataframe consists of 4 columns; original_name: the original name from data_first after preprocessing, match_name_0: the name it is matched to from data_second after preprocessing, score_0: the score of the match, match_index_0: the index of the match in data_second. The match_index_0 can be used to join the data from both dataframes.

`name_matcher`¶

class name_matching.name_matcher.NameMatcher(ngrams: tuple = (2, 3), top_n: int = 50, low_memory: bool = False, number_of_rows: int = 5000, number_of_matches: int = 1, lowercase: bool = True, non_word_characters: bool | None = None, remove_ascii: bool = True, punctuations: bool | None = None, legal_suffixes: bool = False, preprocess_legal: bool = False, delete_legal: bool = False, make_abbreviations: bool = True, common_words: bool | list = False, cut_off_no_scoring_words: float = 0.01, preprocess_split: bool = False, begin_end_legal_pre_suffix: bool = True, verbose: bool = True, distance_metrics: list | tuple = ['overlap', 'weighted_jaccard', 'ratcliff_obershelp', 'fuzzy_wuzzy_token_sort', 'editex'], row_numbers: bool = False, return_algorithms_score: bool = False, save_intermediate_results: bool = False, load_intermediate_results: bool = False, intermediate_results_name: dict[str, str] = {'matching_data': 'df_matching_data_name', 'possible_matches': 'possible_matches_name', 'to_be_matched': 'to_be_matched_name'})¶

Bases: object

Name matching using character n-gram cosine similarity followed by fuzzy matching.

This class first vectorizes names using character n-grams (via TF-IDF and cosine similarity), selects the top N candidates, and then applies multiple fuzzy matching distance metrics to pick the best match or top-K matches.

Parameters:

ngramstuple of int, default=(2, 3): Character n-gram lengths used for cosine similarity.
top_nint, default=50: Number of top candidates to return from the cosine step.
low_memorybool, default=False: If True, uses a low-memory approach for sparse cosine similarity.
number_of_rowsint, default=5000: Batch size for low-memory sparse cosine similarity. Ignored if low_memory is True.
number_of_matchesint, default=1: Number of fuzzy-matched alternatives to return.
lowercasebool, default=True: If True, converts text to lowercase during preprocessing.
non_word_charactersOptional[bool], default=True: If True, strips non-word characters (excluding & and #) during preprocessing.
remove_asciibool, default=True: If True, transliterates to ASCII (dropping accents) during preprocessing.
punctuationsOptional[bool], default=None: Deprecated alias for non_word_characters.
legal_suffixesbool, default=False: If True, post-processing will ignore common company legal suffixes in scoring.
preprocess_legalbool, default=False: If True, strips or abbreviates legal suffixes/prefixes during preprocessing.
delete_legalbool, default=False: If True, deletes legal suffixes/prefixes instead of abbreviating them.
make_abbreviationsbool, default=True: If True, replaces common words with their abbreviations during preprocessing.
common_wordsUnion[bool, list], default=False: If True, will post-process to down-weight the most common words. If a list is provided, those specific words will be down-weighted.
cut_off_no_scoring_wordsfloat, default=0.01: Threshold (fraction of max frequency) above which a word is considered too common.
preprocess_splitbool, default=False: If True, performs an additional “split” variant of preprocessing for searching.
begin_end_legal_pre_suffixbool, default=True: If True, only abbreviate legal terms at the beginning or end of names.
verbosebool, default=True: If True, prints progress via tqdm.
distance_metricslist of str, default=[: “overlap”, “weighted_jaccard”, “ratcliff_obershelp”, “fuzzy_wuzzy_token_sort”, “editex”] List of distance metric names to use in the fuzzy-matching step.
row_numbersbool, default=False: If True, returns original DataFrame index values in the match results.
return_algorithms_scorebool, default=False: If True, return the full per-algorithm score matrix instead of just combined scores.
save_intermediate_resultsbool, default=False: If True, saves intermediate pickle files for matching_data, to_be_matched, possible_matches.
load_intermediate_resultsbool, default=False: If True, attempts to load intermediate pickle files before recomputing.
intermediate_results_namedict of str to str, default={: “matching_data”: “df_matching_data_name”, “to_be_matched”: “to_be_matched_name”, “possible_matches”: “possible_matches_name”
}: Filenames (without “.pkl”) for saving/loading intermediate results.

Methods

`fuzzy_matches`(possible_matches, to_be_matched)	A method which performs the fuzzy matching between the data in the to_be_matched series as well as the indicated indexes of the matching_data points which are possible matching candidates.
`load_and_process_master_data`(column, ...[, ...])	Load the matching data into the NameMatcher and start the preprocessing.
`match_names`(to_be_matched, column_matching)	Match input names against the preprocessed master data.
`postprocess`(match)	Postprocesses the scores to exclude certain specific company words or the most common words.
`preprocess`(df, column_name[, original_name])	Preprocess a dataframe before applying a name matching algorithm.
`set_distance_metrics`(metrics)	A method to set which of the distance metrics should be employed during the fuzzy matching.
`transform_data`()	A method which transforms the matching data based on the ngrams transformer.
`unicode_to_ascii`(text)	Converts a string to ascii characters trhough transliteration.

Raises:

TypeError: If common_words is not a bool or iterable of strings.

__init__(ngrams: tuple = (2, 3), top_n: int = 50, low_memory: bool = False, number_of_rows: int = 5000, number_of_matches: int = 1, lowercase: bool = True, non_word_characters: bool | None = None, remove_ascii: bool = True, punctuations: bool | None = None, legal_suffixes: bool = False, preprocess_legal: bool = False, delete_legal: bool = False, make_abbreviations: bool = True, common_words: bool | list = False, cut_off_no_scoring_words: float = 0.01, preprocess_split: bool = False, begin_end_legal_pre_suffix: bool = True, verbose: bool = True, distance_metrics: list | tuple = ['overlap', 'weighted_jaccard', 'ratcliff_obershelp', 'fuzzy_wuzzy_token_sort', 'editex'], row_numbers: bool = False, return_algorithms_score: bool = False, save_intermediate_results: bool = False, load_intermediate_results: bool = False, intermediate_results_name: dict[str, str] = {'matching_data': 'df_matching_data_name', 'possible_matches': 'possible_matches_name', 'to_be_matched': 'to_be_matched_name'})¶

fuzzy_matches(possible_matches: array, to_be_matched: Series) → Series¶

A method which performs the fuzzy matching between the data in the to_be_matched series as well as the indicated indexes of the matching_data points which are possible matching candidates.

Parameters:

possible_matchesnp.array: An array containing the indexes of the matching data with potential matches
to_be_matchedpd.Series: The data which should be matched

Returns:

pd.Series: A series containing the match index from the matching_data dataframe. the name in the to_be_matched data, the name to which the datapoint was matched and a score between 0 (no match) and 100(perfect match) to indicate the quality of the matches.

load_and_process_master_data(column: str, df_matching_data: DataFrame, start_processing: bool = True, transform: bool = True) → None¶

Load the matching data into the NameMatcher and start the preprocessing.

Parameters:

columnstring: The column name of the dataframe which should be used for the matching
df_matching_data: pd.DataFrame: The dataframe which is used to match the data to.
start_processingbool: A boolean indicating whether to start the preprocessing step after loading the matching data. If transform is True the data will still be transformed and the preprocessing will be marked as completed. default: True
transformbool: A boolean indicating whether or not the data should be transformed after the vectoriser is initialised default: True

match_names(to_be_matched: Series | DataFrame, column_matching: str) → Series | DataFrame | Tuple[DataFrame, DataFrame]¶

Match input names against the preprocessed master data.

This will:

Preprocess the new names.
Compute cosine-similarity top-N candidates.
Apply fuzzy matching to those candidates.

Parameters:

to_be_matchedpandas Series or DataFrame: New names to match.
column_matchingstr: Column name in to_be_matched containing the names.

Returns:

pandas Series or DataFrame

If return_algorithms_score=False and number_of_matches=1, returns a DataFrame containing:

original_name

match_name

score

match_index

If return_algorithms_score=False and number_of_matches>1, returns a DataFrame with columns for each alternative.

If return_algorithms_score=True, returns a tuple ( DataFrame_of_scores, DataFrame_of_matched_names ).

postprocess(match: Series) → Series¶

Postprocesses the scores to exclude certain specific company words or the most common words. In this method only the scores are adjusted, the matches still stand.

Parameters:

matchpd.Series: The series with the possible matches and original scores

Returns:

pd.Series: A new version of the input series with updated scores

preprocess(df: DataFrame, column_name: str, original_name: bool = False) → DataFrame¶

Preprocess a dataframe before applying a name matching algorithm. The preprocessing consists of removing special characters, spaces, converting all characters to lower case and removing the words given in the word lists

Parameters:

dfDataFrame: The dataframe or series on which the preprocessing needs to be performed
column_namestr: The name of the column that is used for the preprocessing
original_namebool: If True, returns an additional column ‘original_name’ in the dataframe this column holds the original, non-processed name. default=False

Returns:

pd.DataFrame: The preprocessed dataframe or series depending on the input

set_distance_metrics(metrics: List) → None¶

A method to set which of the distance metrics should be employed during the fuzzy matching. For very short explanations of most of the name matching algorithms please see the make_distance_metrics function in distance_matrics.py

Parameters:

metrics: List: The list with the distance metrics to be used during the name matching. The distance metrics can be chosen from the list below:

indel discounted_levenshtein tichy cormodeL_z iterative_sub_string baulieu_xiii clement dice_asymmetricI kuhns_iii overlap pearson_ii weighted_jaccard warrens_iv bag rouge_l q_grams ratcliff_obershelp ncd_bz2 fuzzy_wuzzy_partial_string fuzzy_wuzzy_token_sort fuzzy_wuzzy_token_set editex typo lig_3 ssk refined_soundex double_metaphone

transform_data() → None¶: A method which transforms the matching data based on the ngrams transformer. After the transformation (the generation of the ngrams), the data is normalised by dividing each row by the sum of the row. Subsequently the data is changed to a coo sparse matrix format with the column indices in ascending order.

unicode_to_ascii(text: str) → str¶

Converts a string to ascii characters trhough transliteration. The transliteration map is stored in the transliterations.py file in the data folder.

Parameters:

teststr: The text to be transliterated to ascii characters

Returns:

str: The process text without any non-ascii characters

`distance_metrics`¶

A function which returns a dict containing the distance metrics that should be used during the fuzzy string matching

Levenshtein edit distance

Indel
Discounted Levenshtein
levenshtein
LIG3
Jaro-Winkler

Block edit distances

Tichy
CormodeLZ

Multi-set token-based distance

BaulieuXIII
Clement
DiceAsymmetricI
KuhnsIII
Overlap
PearsonII
WeightedJaccard
WarrensIV
Bag
RougeL
Token distance

Subsequence distances

IterativeSubString
RatcliffObershelp
SSK

Normalized compression distance

NCDbz2

FuzzyWuzzy distances

FuzzyWuzzyPartialString
FuzzyWuzzyTokenSort
FuzzyWuzzyTokenSet

Ponetic distances

RefinedSoundex
DoubleMetaphone

Edit distances

Editex
Typo

Parameters:

indel: bool

Boolean indicating whether the Indel method should be used during the fuzzy name matching. The indel method is equal to a regular levenshtein distance with a twice as high substitution weight. If a dictionary is provided, it is used as parameters for the indel distance metric. default=False

discounted_levenshtein: bool

Boolean indicating whether the DiscountedLevenshtein method should be used during the fuzzy name matching. Equal to the regular levenshtein distance, only errors later in the string are counted at a discounted rate. To limit the importance of for instance suffix differences. If a dictionary is provided, it is used as parameters for the discounted_levenshtein distance metric. default=False

levenshtein: bool

Boolean indicating whether the Levenshtein method should be used during the fuzzy name matching. If a dictionary is provided, it is used as parameters for the levenshtein distance metric. default=False

jaro_winkler: bool

Boolean indicating whether the JaroWinkler method should be used during the fuzzy name matching. If a dictionary is provided, it is used as parameters for the jaro winkler distance metric. default=False

tichy: bool

Boolean indicating whether the Tichy method should be used during the fuzzy name matching. This algorithm provides a shortest edit distance based on substring and add operations. If a dictionary is provided, it is used as parameters for the tichy distance metric. default=False

cormodel_z: bool

Boolean indicating whether the CormodeLZ method should be used during the fuzzy name matching. The CormodeLZ distance between strings x and y, is the minimum number of single characters or substrings of y or of the partially built string which are required to produce x from left to right. If a dictionary is provided, it is used as parameters for the cormodel_z distance metric. default=False

iterative_sub_string: bool

Boolean indicating whether the IterativeSubString method should be used during the fuzzy name matching. A method that counts the similarities between two strings substrings and subtracts the differences taking into account the winkler similarity between the string and the substring. If a dictionary is provided, it is used as parameters for the iterative_sub_string distance metric. default=False

baulieu_xiii: bool

Boolean indicating whether the BaulieuXIII method should be used during the fuzzy name matching. The Baulieu XIII distance between two strings is given by the following formula: (|X \ Y| + |Y \ X|) / ( |X ∩ Y| + |X \ Y| + |Y \ X| + |X ∩ Y| ∙ (|X ∩ Y| - 4)^2). If a dictionary is provided, it is used as parameters for the baulieu_xiii distance metric. default=False

clement: bool

Boolean indicating whether the Clement method should be used during the fuzzy name matching. The Clement distance between two strings is given by the following formula: (|X ∩ Y|/|X|)*(1-|X|/|N|) + (|(N \ X) \ Y|/|N \ X|) * (1-|N \ X|/|N|). If a dictionary is provided, it is used as parameters for the clement distance metric. default=False

dice_asymmetrici: bool

Boolean indicating whether the DiceAsymmetricI method should be used during the fuzzy name matching. The Dice asymmetric similarity is given be |X ∩ Y|/|X|. If a dictionary is provided, it is used as parameters for the dice_asymmetrici distance metric. default=False

kuhns_iii: bool

Boolean indicating whether the KuhnsIII method should be used during the fuzzy name matching. If a dictionary is provided, it is used as parameters for the kuhns_iii distance metric. default=False

overlap: bool

Boolean indicating whether the Overlap method should be used during the fuzzy name matching. The overlap distance is given by: |X ∩ Y|/min(|X|,|Y|). If a dictionary is provided, it is used as parameters for the overlap distance metric. default=True

pearson_ii: bool

Boolean indicating whether the PearsonII method should be used during the fuzzy name matching. This algorithm is based on the Phi coefficient or the mean square contingency. If a dictionary is provided, it is used as parameters for the pearson_ii distance metric. default=False

weighted_jaccard: bool

Boolean indicating whether the WeightedJaccard method should be used during the fuzzy name matching. This is the Jaccard distance only using a wheighing for the differences of 3. If a dictionary is provided, it is used as parameters for the weighted_jaccard distance metric. default=True

warrens_iv: bool

Boolean indicating whether the WarrensIV method should be used during the fuzzy name matching. If a dictionary is provided, it is used as parameters for the warrens_iv distance metric. default=False

bag: bool

Boolean indicating whether the Bag method should be used during the fuzzy: name matching. Is a simplification of the regular edit distance by using a similarity tree structure. If a dictionary is provided,

it is used as parameters for the bag distance metric. default=False

rouge_l: bool

Boolean indicating whether the ROUGE-L method should be used during the fuzzy name matching. The ROGUE-L method is a measure that counts the longest substring between to strings. If a dictionary is provided, it is used as parameters for the rouge_l distance metric. default=False

token_distance: bool

Boolean indicating whether the a token distance method should be used during the fuzzy name matching. If a dictionary is provided, it is used as parameters for the tokenised distance metrics. default=False

ratcliff_obershelp: bool

Boolean indicating whether the RatcliffObershelp method should be used during the fuzzy name matching. This method finds the longest common substring and evaluates the longest common substrings to the right and the left of the original longest common substring. If a dictionary is provided, it is used as parameters for the ratcliff_obershelp distance metric. default=True

ncd_bz2: bool

Boolean indicating whether the NCDbz2 method should be used during the fuzzy name matching. Applies the Burrows-Wheeler transform to the strings and subsequently returns the normalised compression distance. If a dictionary is provided, it is used as parameters for the ncd_bz2 distance metric. default=False

fuzzy_wuzzy_partial_string: bool

Boolean indicating whether the FuzzyWuzzyPartialString method should be used during the fuzzy name matching. This methods takes the length of the longest common substring and divides it over the minimum of the length of each of the two strings. If a dictionary is provided, it is used as parameters for the fuzzy_wuzzy_partial_string distance metric. default=False

fuzzy_wuzzy_token_sort: bool

Boolean indicating whether the FuzzyWuzzyTokenSort method should be used during the fuzzy name matching. This tokenizes the words in the string and sorts them, subsequently a hamming distance is calculated. If a dictionary is provided, it is used as parameters for the fuzzy_wuzzy_token_sort distance metric. default=True

fuzzy_wuzzy_token_set: bool

Boolean indicating whether the FuzzyWuzzyTokenSet method should be used during the fuzzy name matching. This method tokenizes the strings and find the largest intersection of the two substrings and divides it over the length of the shortest string. If a dictionary is provided, it is used as parameters for the fuzzy_wuzzy_token_set distance metric. default=False

editex: bool

Boolean indicating whether the Editex method should be used during the fuzzy name matching. If a dictionary is provided, it is used as parameters for the editex distance metric. default=True

typo: bool

Boolean indicating whether the Typo method should be used during the fuzzy name matching. The typo distance is calculated based on the distance on a keyboard between edits. If a dictionary is provided, it is used as parameters for the typo distance metric. default=False

lig_3: bool

Boolean indicating whether the LIG3 method should be used during the fuzzy name matching. If a dictionary is provided, it is used as parameters for the lig_3 distance metric. default=False

ssk: bool

Boolean indicating whether the SSK method should be used during the fuzzy name matching. The ssk algorithm looks at the string kernel generated by all the possible different subsequences present between the two strings. If a dictionary is provided, it is used as parameters for the ssk distance metric. default=False

refined_soundex: bool

Boolean indicating whether the string should be represented by the RefinedSoundex phonetix translation and the Levensthein distance of the translated strings should be included in the fuzzy matching process. If a dictionary is provided, it is used as parameters for the refined_soundex distance metric. default=False

double_metaphone: bool

Boolean indicating whether the string should be represented by the DoubleMetaphone phonetix translation and the Levensthein distance of the translated strings should be included in the fuzzy matching process. If a dictionary is provided, it is used as parameters for the Double Metaphone distance metric. default=False

`sparse_cosine`¶

name_matching.sparse_cosine.sparse_cosine_top_n(matrix_a: csc_matrix | coo_matrix, matrix_b: csc_matrix, top_n: int, low_memory: bool, number_of_rows: int, verbose: bool)¶

Calculates the top_n cosine matches between matrix_a and matrix_b. Takes into account the amount of memory that should be used based on the low_memory int

Parameters¶

matrix_a: csc_matric: The largest sparse csc matrix which should be multiplied
matrix_b: csc_matric: The smallest sparse csc matrix which should be multiplied
top_n: int: The best n matches that should be returned
low_memory: bool: A bool indicating whether the low memory sparse cosine approach should be used
number_of_rows: int: An int inidcating the number of rows which should be processed at once when calculating the cosine simalarity
verbose: bool: A boolean indicating whether the progress should be printed

Returns:

np.array: The indexes for the n best sparse cosine matches between matrix a and b

`name_matching`¶

`run_nm`¶

`name_matcher`¶

`distance_metrics`¶

`sparse_cosine`¶

Parameters¶

name matching

Navigation

Related Topics

name_matching¶

run_nm¶

name_matcher¶

distance_metrics¶

sparse_cosine¶

Parameters¶

`name_matching`¶

`run_nm`¶

`name_matcher`¶

`distance_metrics`¶

`sparse_cosine`¶