name_matching¶
run_nm¶
- name_matching.run_nm.match_names(data_first: DataFrame | Series, data_second: DataFrame | Series, column_first='', column_second='', group_column_first='', group_column_second='', case_sensitive=False, punctuation_sensitive=False, special_character_sensitive=False, threshold=95, **kwargs) DataFrame¶
Function which performs name matching. First a simple merge on the data is performed to get the instances in which the name matches perfectly. Subsequently the matches are matched using the name matching algorithm as defined in name_matcher.
- Parameters:
- data_first: Union[pd.DataFrame, pd.Series]
The first dataframe or series used for the name matching
- data_second: Union[pd.DataFrame, pd.Series]
The second dataframe or series used for the name matching, for matching the data to itself data_second should be equal to data first
- column_first: str
If data_first is a dataframe column_first should be the column in which the name that should be matched can be found for data_first default=’’
- column_second: str
If data_second is a dataframe column_second should be the column in which the name that should be matched can be found for data_second default=’’
- group_column_first: str
The name of the column that should be used to generate groups within the data_first dataframe. The matchig is then only performed for instances in which the groups are identical default=’’
- group_column_second: str
The name of the column that should be used to generate groups within the data_second dataframe. The matchig is then only performed for instances in which the groups are identical default=’’
- case_sensitive: bool
Boolean value indicating whether the names should be converted to lower case names before the name matching starts. If False all the characters are converted to lowercase default=False
- punctuation_sensitive: bool
Boolean value indicating whether punctuations should be removed from the original names before the name matching starts. If False the punctuations are removed default=False
- special_character_sensitive: bool
Boolean value indicating whether special characters should be converted to unicode before the name matching starts. If False the special characters are replaced default=False
- threshold: int
the minimal score a match should have to be part of the output default=95
- **kwargs
Additional inputs for the name_matcher
- Returns:
- pd.DataFrame
A dataframe containing the matched rows were the match score is above the threshold. The dataframe consists of 4 columns; original_name: the original name from data_first after preprocessing, match_name_0: the name it is matched to from data_second after preprocessing, score_0: the score of the match, match_index_0: the index of the match in data_second. The match_index_0 can be used to join the data from both dataframes.
name_matcher¶
- class name_matching.name_matcher.NameMatcher(ngrams: tuple = (2, 3), top_n: int = 50, low_memory: bool = False, number_of_rows: int = 5000, number_of_matches: int = 1, lowercase: bool = True, non_word_characters: bool | None = None, remove_ascii: bool = True, punctuations: bool | None = None, legal_suffixes: bool = False, preprocess_legal: bool = False, delete_legal: bool = False, make_abbreviations: bool = True, common_words: bool | list = False, cut_off_no_scoring_words: float = 0.01, preprocess_split: bool = False, begin_end_legal_pre_suffix: bool = True, verbose: bool = True, distance_metrics: list | tuple = ['overlap', 'weighted_jaccard', 'ratcliff_obershelp', 'fuzzy_wuzzy_token_sort', 'editex'], row_numbers: bool = False, return_algorithms_score: bool = False, save_intermediate_results: bool = False, load_intermediate_results: bool = False, intermediate_results_name: dict[str, str] = {'matching_data': 'df_matching_data_name', 'possible_matches': 'possible_matches_name', 'to_be_matched': 'to_be_matched_name'})¶
Bases:
objectName matching using character n-gram cosine similarity followed by fuzzy matching.
This class first vectorizes names using character n-grams (via TF-IDF and cosine similarity), selects the top N candidates, and then applies multiple fuzzy matching distance metrics to pick the best match or top-K matches.
- Parameters:
- ngramstuple of int, default=(2, 3)
Character n-gram lengths used for cosine similarity.
- top_nint, default=50
Number of top candidates to return from the cosine step.
- low_memorybool, default=False
If True, uses a low-memory approach for sparse cosine similarity.
- number_of_rowsint, default=5000
Batch size for low-memory sparse cosine similarity. Ignored if low_memory is True.
- number_of_matchesint, default=1
Number of fuzzy-matched alternatives to return.
- lowercasebool, default=True
If True, converts text to lowercase during preprocessing.
- non_word_charactersOptional[bool], default=True
If True, strips non-word characters (excluding & and #) during preprocessing.
- remove_asciibool, default=True
If True, transliterates to ASCII (dropping accents) during preprocessing.
- punctuationsOptional[bool], default=None
Deprecated alias for non_word_characters.
- legal_suffixesbool, default=False
If True, post-processing will ignore common company legal suffixes in scoring.
- preprocess_legalbool, default=False
If True, strips or abbreviates legal suffixes/prefixes during preprocessing.
- delete_legalbool, default=False
If True, deletes legal suffixes/prefixes instead of abbreviating them.
- make_abbreviationsbool, default=True
If True, replaces common words with their abbreviations during preprocessing.
- common_wordsUnion[bool, list], default=False
If True, will post-process to down-weight the most common words. If a list is provided, those specific words will be down-weighted.
- cut_off_no_scoring_wordsfloat, default=0.01
Threshold (fraction of max frequency) above which a word is considered too common.
- preprocess_splitbool, default=False
If True, performs an additional “split” variant of preprocessing for searching.
- begin_end_legal_pre_suffixbool, default=True
If True, only abbreviate legal terms at the beginning or end of names.
- verbosebool, default=True
If True, prints progress via tqdm.
- distance_metricslist of str, default=[
“overlap”, “weighted_jaccard”, “ratcliff_obershelp”, “fuzzy_wuzzy_token_sort”, “editex”] List of distance metric names to use in the fuzzy-matching step.
- row_numbersbool, default=False
If True, returns original DataFrame index values in the match results.
- return_algorithms_scorebool, default=False
If True, return the full per-algorithm score matrix instead of just combined scores.
- save_intermediate_resultsbool, default=False
If True, saves intermediate pickle files for matching_data, to_be_matched, possible_matches.
- load_intermediate_resultsbool, default=False
If True, attempts to load intermediate pickle files before recomputing.
- intermediate_results_namedict of str to str, default={
“matching_data”: “df_matching_data_name”, “to_be_matched”: “to_be_matched_name”, “possible_matches”: “possible_matches_name”
- }
Filenames (without “.pkl”) for saving/loading intermediate results.
Methods
fuzzy_matches(possible_matches, to_be_matched)A method which performs the fuzzy matching between the data in the to_be_matched series as well as the indicated indexes of the matching_data points which are possible matching candidates.
load_and_process_master_data(column, ...[, ...])Load the matching data into the NameMatcher and start the preprocessing.
match_names(to_be_matched, column_matching)Match input names against the preprocessed master data.
postprocess(match)Postprocesses the scores to exclude certain specific company words or the most common words.
preprocess(df, column_name[, original_name])Preprocess a dataframe before applying a name matching algorithm.
set_distance_metrics(metrics)A method to set which of the distance metrics should be employed during the fuzzy matching.
A method which transforms the matching data based on the ngrams transformer.
unicode_to_ascii(text)Converts a string to ascii characters trhough transliteration.
- Raises:
- TypeError
If common_words is not a bool or iterable of strings.
- __init__(ngrams: tuple = (2, 3), top_n: int = 50, low_memory: bool = False, number_of_rows: int = 5000, number_of_matches: int = 1, lowercase: bool = True, non_word_characters: bool | None = None, remove_ascii: bool = True, punctuations: bool | None = None, legal_suffixes: bool = False, preprocess_legal: bool = False, delete_legal: bool = False, make_abbreviations: bool = True, common_words: bool | list = False, cut_off_no_scoring_words: float = 0.01, preprocess_split: bool = False, begin_end_legal_pre_suffix: bool = True, verbose: bool = True, distance_metrics: list | tuple = ['overlap', 'weighted_jaccard', 'ratcliff_obershelp', 'fuzzy_wuzzy_token_sort', 'editex'], row_numbers: bool = False, return_algorithms_score: bool = False, save_intermediate_results: bool = False, load_intermediate_results: bool = False, intermediate_results_name: dict[str, str] = {'matching_data': 'df_matching_data_name', 'possible_matches': 'possible_matches_name', 'to_be_matched': 'to_be_matched_name'})¶
- fuzzy_matches(possible_matches: array, to_be_matched: Series) Series¶
A method which performs the fuzzy matching between the data in the to_be_matched series as well as the indicated indexes of the matching_data points which are possible matching candidates.
- Parameters:
- possible_matchesnp.array
An array containing the indexes of the matching data with potential matches
- to_be_matchedpd.Series
The data which should be matched
- Returns:
- pd.Series
A series containing the match index from the matching_data dataframe. the name in the to_be_matched data, the name to which the datapoint was matched and a score between 0 (no match) and 100(perfect match) to indicate the quality of the matches.
- load_and_process_master_data(column: str, df_matching_data: DataFrame, start_processing: bool = True, transform: bool = True) None¶
Load the matching data into the NameMatcher and start the preprocessing.
- Parameters:
- columnstring
The column name of the dataframe which should be used for the matching
- df_matching_data: pd.DataFrame
The dataframe which is used to match the data to.
- start_processingbool
A boolean indicating whether to start the preprocessing step after loading the matching data. If transform is True the data will still be transformed and the preprocessing will be marked as completed. default: True
- transformbool
A boolean indicating whether or not the data should be transformed after the vectoriser is initialised default: True
- match_names(to_be_matched: Series | DataFrame, column_matching: str) Series | DataFrame | Tuple[DataFrame, DataFrame]¶
Match input names against the preprocessed master data.
- This will:
Preprocess the new names.
Compute cosine-similarity top-N candidates.
Apply fuzzy matching to those candidates.
- Parameters:
- to_be_matchedpandas Series or DataFrame
New names to match.
- column_matchingstr
Column name in to_be_matched containing the names.
- Returns:
- pandas Series or DataFrame
If return_algorithms_score=False and number_of_matches=1, returns a DataFrame containing:
original_name
match_name
score
match_index
If return_algorithms_score=False and number_of_matches>1, returns a DataFrame with columns for each alternative.
If return_algorithms_score=True, returns a tuple ( DataFrame_of_scores, DataFrame_of_matched_names ).
- postprocess(match: Series) Series¶
Postprocesses the scores to exclude certain specific company words or the most common words. In this method only the scores are adjusted, the matches still stand.
- Parameters:
- matchpd.Series
The series with the possible matches and original scores
- Returns:
- pd.Series
A new version of the input series with updated scores
- preprocess(df: DataFrame, column_name: str, original_name: bool = False) DataFrame¶
Preprocess a dataframe before applying a name matching algorithm. The preprocessing consists of removing special characters, spaces, converting all characters to lower case and removing the words given in the word lists
- Parameters:
- dfDataFrame
The dataframe or series on which the preprocessing needs to be performed
- column_namestr
The name of the column that is used for the preprocessing
- original_namebool
If True, returns an additional column ‘original_name’ in the dataframe this column holds the original, non-processed name. default=False
- Returns:
- pd.DataFrame
The preprocessed dataframe or series depending on the input
- set_distance_metrics(metrics: List) None¶
A method to set which of the distance metrics should be employed during the fuzzy matching. For very short explanations of most of the name matching algorithms please see the make_distance_metrics function in distance_matrics.py
- Parameters:
- metrics: List
The list with the distance metrics to be used during the name matching. The distance metrics can be chosen from the list below:
indel discounted_levenshtein tichy cormodeL_z iterative_sub_string baulieu_xiii clement dice_asymmetricI kuhns_iii overlap pearson_ii weighted_jaccard warrens_iv bag rouge_l q_grams ratcliff_obershelp ncd_bz2 fuzzy_wuzzy_partial_string fuzzy_wuzzy_token_sort fuzzy_wuzzy_token_set editex typo lig_3 ssk refined_soundex double_metaphone
- transform_data() None¶
A method which transforms the matching data based on the ngrams transformer. After the transformation (the generation of the ngrams), the data is normalised by dividing each row by the sum of the row. Subsequently the data is changed to a coo sparse matrix format with the column indices in ascending order.
- unicode_to_ascii(text: str) str¶
Converts a string to ascii characters trhough transliteration. The transliteration map is stored in the transliterations.py file in the data folder.
- Parameters:
- teststr
The text to be transliterated to ascii characters
- Returns:
- str
The process text without any non-ascii characters
distance_metrics¶
- name_matching.distance_metrics.make_distance_metrics(indel: bool | dict = False, discounted_levenshtein: bool | dict = False, levenshtein: bool | dict = False, jaro_winkler: bool | dict = False, tichy: bool | dict = False, cormodel_z: bool | dict = False, iterative_sub_string: bool | dict = False, baulieu_xiii: bool | dict = False, clement: bool | dict = False, dice_asymmetrici: bool | dict = False, kuhns_iii: bool | dict = False, overlap: bool | dict = False, pearson_ii: bool | dict = False, weighted_jaccard: bool | dict = False, warrens_iv: bool | dict = False, bag: bool | dict = False, rouge_l: bool | dict = False, token_distance: bool | dict = False, ratcliff_obershelp: bool | dict = False, ncd_bz2: bool | dict = False, fuzzy_wuzzy_partial_string: bool | dict = False, fuzzy_wuzzy_token_sort: bool | dict = False, fuzzy_wuzzy_token_set: bool | dict = False, editex: bool | dict = False, typo: bool | dict = False, lig_3: bool | dict = False, ssk: bool | dict = False, refined_soundex: bool | dict = False, double_metaphone: bool | dict = False) dict¶
A function which returns a dict containing the distance metrics that should be used during the fuzzy string matching
- Levenshtein edit distance
Indel
Discounted Levenshtein
levenshtein
LIG3
Jaro-Winkler
- Block edit distances
Tichy
CormodeLZ
- Multi-set token-based distance
BaulieuXIII
Clement
DiceAsymmetricI
KuhnsIII
Overlap
PearsonII
WeightedJaccard
WarrensIV
Bag
RougeL
Token distance
- Subsequence distances
IterativeSubString
RatcliffObershelp
SSK
- Normalized compression distance
NCDbz2
- FuzzyWuzzy distances
FuzzyWuzzyPartialString
FuzzyWuzzyTokenSort
FuzzyWuzzyTokenSet
- Ponetic distances
RefinedSoundex
DoubleMetaphone
- Edit distances
Editex
Typo
- Parameters:
- indel: bool
Boolean indicating whether the Indel method should be used during the fuzzy name matching. The indel method is equal to a regular levenshtein distance with a twice as high substitution weight. If a dictionary is provided, it is used as parameters for the indel distance metric. default=False
- discounted_levenshtein: bool
Boolean indicating whether the DiscountedLevenshtein method should be used during the fuzzy name matching. Equal to the regular levenshtein distance, only errors later in the string are counted at a discounted rate. To limit the importance of for instance suffix differences. If a dictionary is provided, it is used as parameters for the discounted_levenshtein distance metric. default=False
- levenshtein: bool
Boolean indicating whether the Levenshtein method should be used during the fuzzy name matching. If a dictionary is provided, it is used as parameters for the levenshtein distance metric. default=False
- jaro_winkler: bool
Boolean indicating whether the JaroWinkler method should be used during the fuzzy name matching. If a dictionary is provided, it is used as parameters for the jaro winkler distance metric. default=False
- tichy: bool
Boolean indicating whether the Tichy method should be used during the fuzzy name matching. This algorithm provides a shortest edit distance based on substring and add operations. If a dictionary is provided, it is used as parameters for the tichy distance metric. default=False
- cormodel_z: bool
Boolean indicating whether the CormodeLZ method should be used during the fuzzy name matching. The CormodeLZ distance between strings x and y, is the minimum number of single characters or substrings of y or of the partially built string which are required to produce x from left to right. If a dictionary is provided, it is used as parameters for the cormodel_z distance metric. default=False
- iterative_sub_string: bool
Boolean indicating whether the IterativeSubString method should be used during the fuzzy name matching. A method that counts the similarities between two strings substrings and subtracts the differences taking into account the winkler similarity between the string and the substring. If a dictionary is provided, it is used as parameters for the iterative_sub_string distance metric. default=False
- baulieu_xiii: bool
Boolean indicating whether the BaulieuXIII method should be used during the fuzzy name matching. The Baulieu XIII distance between two strings is given by the following formula: (|X \ Y| + |Y \ X|) / ( |X ∩ Y| + |X \ Y| + |Y \ X| + |X ∩ Y| ∙ (|X ∩ Y| - 4)^2). If a dictionary is provided, it is used as parameters for the baulieu_xiii distance metric. default=False
- clement: bool
Boolean indicating whether the Clement method should be used during the fuzzy name matching. The Clement distance between two strings is given by the following formula: (|X ∩ Y|/|X|)*(1-|X|/|N|) + (|(N \ X) \ Y|/|N \ X|) * (1-|N \ X|/|N|). If a dictionary is provided, it is used as parameters for the clement distance metric. default=False
- dice_asymmetrici: bool
Boolean indicating whether the DiceAsymmetricI method should be used during the fuzzy name matching. The Dice asymmetric similarity is given be |X ∩ Y|/|X|. If a dictionary is provided, it is used as parameters for the dice_asymmetrici distance metric. default=False
- kuhns_iii: bool
Boolean indicating whether the KuhnsIII method should be used during the fuzzy name matching. If a dictionary is provided, it is used as parameters for the kuhns_iii distance metric. default=False
- overlap: bool
Boolean indicating whether the Overlap method should be used during the fuzzy name matching. The overlap distance is given by: |X ∩ Y|/min(|X|,|Y|). If a dictionary is provided, it is used as parameters for the overlap distance metric. default=True
- pearson_ii: bool
Boolean indicating whether the PearsonII method should be used during the fuzzy name matching. This algorithm is based on the Phi coefficient or the mean square contingency. If a dictionary is provided, it is used as parameters for the pearson_ii distance metric. default=False
- weighted_jaccard: bool
Boolean indicating whether the WeightedJaccard method should be used during the fuzzy name matching. This is the Jaccard distance only using a wheighing for the differences of 3. If a dictionary is provided, it is used as parameters for the weighted_jaccard distance metric. default=True
- warrens_iv: bool
Boolean indicating whether the WarrensIV method should be used during the fuzzy name matching. If a dictionary is provided, it is used as parameters for the warrens_iv distance metric. default=False
- bag: bool
- Boolean indicating whether the Bag method should be used during the fuzzy
name matching. Is a simplification of the regular edit distance by using a similarity tree structure. If a dictionary is provided,
it is used as parameters for the bag distance metric. default=False
- rouge_l: bool
Boolean indicating whether the ROUGE-L method should be used during the fuzzy name matching. The ROGUE-L method is a measure that counts the longest substring between to strings. If a dictionary is provided, it is used as parameters for the rouge_l distance metric. default=False
- token_distance: bool
Boolean indicating whether the a token distance method should be used during the fuzzy name matching. If a dictionary is provided, it is used as parameters for the tokenised distance metrics. default=False
- ratcliff_obershelp: bool
Boolean indicating whether the RatcliffObershelp method should be used during the fuzzy name matching. This method finds the longest common substring and evaluates the longest common substrings to the right and the left of the original longest common substring. If a dictionary is provided, it is used as parameters for the ratcliff_obershelp distance metric. default=True
- ncd_bz2: bool
Boolean indicating whether the NCDbz2 method should be used during the fuzzy name matching. Applies the Burrows-Wheeler transform to the strings and subsequently returns the normalised compression distance. If a dictionary is provided, it is used as parameters for the ncd_bz2 distance metric. default=False
- fuzzy_wuzzy_partial_string: bool
Boolean indicating whether the FuzzyWuzzyPartialString method should be used during the fuzzy name matching. This methods takes the length of the longest common substring and divides it over the minimum of the length of each of the two strings. If a dictionary is provided, it is used as parameters for the fuzzy_wuzzy_partial_string distance metric. default=False
- fuzzy_wuzzy_token_sort: bool
Boolean indicating whether the FuzzyWuzzyTokenSort method should be used during the fuzzy name matching. This tokenizes the words in the string and sorts them, subsequently a hamming distance is calculated. If a dictionary is provided, it is used as parameters for the fuzzy_wuzzy_token_sort distance metric. default=True
- fuzzy_wuzzy_token_set: bool
Boolean indicating whether the FuzzyWuzzyTokenSet method should be used during the fuzzy name matching. This method tokenizes the strings and find the largest intersection of the two substrings and divides it over the length of the shortest string. If a dictionary is provided, it is used as parameters for the fuzzy_wuzzy_token_set distance metric. default=False
- editex: bool
Boolean indicating whether the Editex method should be used during the fuzzy name matching. If a dictionary is provided, it is used as parameters for the editex distance metric. default=True
- typo: bool
Boolean indicating whether the Typo method should be used during the fuzzy name matching. The typo distance is calculated based on the distance on a keyboard between edits. If a dictionary is provided, it is used as parameters for the typo distance metric. default=False
- lig_3: bool
Boolean indicating whether the LIG3 method should be used during the fuzzy name matching. If a dictionary is provided, it is used as parameters for the lig_3 distance metric. default=False
- ssk: bool
Boolean indicating whether the SSK method should be used during the fuzzy name matching. The ssk algorithm looks at the string kernel generated by all the possible different subsequences present between the two strings. If a dictionary is provided, it is used as parameters for the ssk distance metric. default=False
- refined_soundex: bool
Boolean indicating whether the string should be represented by the RefinedSoundex phonetix translation and the Levensthein distance of the translated strings should be included in the fuzzy matching process. If a dictionary is provided, it is used as parameters for the refined_soundex distance metric. default=False
- double_metaphone: bool
Boolean indicating whether the string should be represented by the DoubleMetaphone phonetix translation and the Levensthein distance of the translated strings should be included in the fuzzy matching process. If a dictionary is provided, it is used as parameters for the Double Metaphone distance metric. default=False
sparse_cosine¶
- name_matching.sparse_cosine.sparse_cosine_top_n(matrix_a: csc_matrix | coo_matrix, matrix_b: csc_matrix, top_n: int, low_memory: bool, number_of_rows: int, verbose: bool)¶
Calculates the top_n cosine matches between matrix_a and matrix_b. Takes into account the amount of memory that should be used based on the low_memory int
Parameters¶
- matrix_a: csc_matric
The largest sparse csc matrix which should be multiplied
- matrix_b: csc_matric
The smallest sparse csc matrix which should be multiplied
- top_n: int
The best n matches that should be returned
- low_memory: bool
A bool indicating whether the low memory sparse cosine approach should be used
- number_of_rows: int
An int inidcating the number of rows which should be processed at once when calculating the cosine simalarity
- verbose: bool
A boolean indicating whether the progress should be printed
- Returns:
- np.array
The indexes for the n best sparse cosine matches between matrix a and b