KeywordProcessor Class Doc¶

class pyflashtext.keyword.KeywordProcessor(case_sensitive=False)¶

Attributes:

_keyword (str): Used as key to store keywords in trie dictionary.: Defaults to ‘_keyword_’
non_word_boundaries (set(str)): Characters that will determine if the word is continuing.: Defaults to set([A-Za-z0-9_])
keyword_trie_dict (dict): Trie dict built character by character, that is used for lookup: Defaults to empty dictionary
case_sensitive (boolean): if the search algorithm should be case sensitive or not.: Defaults to False

Examples:

>>> # import module
>>> from pyflashtext import KeywordProcessor
>>> # Create an object of KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> # add keywords
>>> keyword_names = ['NY', 'new-york', 'SF']
>>> clean_names = ['new york', 'new york', 'san francisco']
>>> for keyword_name, clean_name in zip(keyword_names, clean_names):
>>>     keyword_processor.add_keyword(keyword_name, clean_name)
>>> keywords_found = keyword_processor.extract_keywords('I love SF and NY. new-york is the best.')
>>> keywords_found
>>> ['san francisco', 'new york', 'new york']

Note:

loosely based on Aho-Corasick algorithm.
Idea came from this Stack Overflow Question.

add_keyword(keyword, clean_name=None)¶

To add one or more keywords to the dictionary pass the keyword and the clean name it maps to.

Args:

keywordstring: keyword that you want to identify
clean_namestring: clean term for that keyword that you would want to get back in return or replace if not provided, keyword will be used as the clean name also.

Returns:

statusbool: The return value. True for success, False otherwise.

Examples:

>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> # This case 'Big Apple' will return 'New York'
>>> # OR
>>> keyword_processor.add_keyword('Big Apple')
>>> # This case 'Big Apple' will return 'Big Apple'

add_keyword_from_file(keyword_file, encoding='utf-8')¶

To add keywords from a file

Args:

keyword_file : path to keywords file encoding : specify the encoding of the file

Examples:

keywords file format can be like:

>>> # Option 1: keywords.txt content
>>> # java_2e=>java
>>> # java programing=>java
>>> # product management=>product management
>>> # product management techniques=>product management

>>> # Option 2: keywords.txt content
>>> # java
>>> # python
>>> # c++

>>> keyword_processor.add_keyword_from_file('keywords.txt')

Raises:

IOError: If keyword_file path is not valid

add_keywords_from_dict(keyword_dict)¶

To add keywords from a dictionary

Args:

keyword_dict (dict): A dictionary with str key and (list str) as value

Examples:

>>> keyword_dict = {
        "java": ["java_2e", "java programing"],
        "product management": ["PM", "product manager"]
    }
>>> keyword_processor.add_keywords_from_dict(keyword_dict)

Raises:

AttributeError: If value for a key in keyword_dict is not a list.

add_keywords_from_list(keyword_list)¶

To add keywords from a list

Args:

keyword_list (list(str)): List of keywords to add

Examples:

>>> keyword_processor.add_keywords_from_list(["java", "python"]})

Raises:

AttributeError: If keyword_list is not a list.

add_non_word_boundary(character)¶

add a character that will be considered as part of word.

Args:

character (char):: Character that will be considered as part of word.

extract_keywords(sentence, span_info=False, max_cost=0)¶

Searches in the string for all keywords present in corpus. Keywords present are added to a list keywords_extracted and returned.

Args:

sentence (str): Line of text where we will search for keywords span_info (bool): True if you need to span the boundaries where the extraction has been performed max_cost (int): maximum levensthein distance to accept when extracting keywords

Returns:

keywords_extracted (list(str)): List of terms/keywords found in sentence that match our corpus

Examples:

>>> from pyflashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('Bay Area')
>>> keywords_found = keyword_processor.extract_keywords('I love Big Apple and Bay Area.')
>>> keywords_found
>>> ['New York', 'Bay Area']
>>> keywords_found = keyword_processor.extract_keywords('I love Big Aple and Baay Area.', max_cost=1)
>>> keywords_found
>>> ['New York', 'Bay Area']

get_all_keywords(term_so_far='', current_dict=None)¶

Recursively builds a dictionary of keywords present in the dictionary And the clean name mapped to those keywords.

Args:

term_so_farstring: term built so far by adding all previous characters
current_dictdict: current recursive position in dictionary

Returns:

terms_presentdict: A map of key and value where each key is a term in the keyword_trie_dict. And value mapped to it is the clean name mapped to it.

Examples:

>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('j2ee', 'Java')
>>> keyword_processor.add_keyword('Python', 'Python')
>>> keyword_processor.get_all_keywords()
>>> {'j2ee': 'Java', 'python': 'Python'}
>>> # NOTE: for case_insensitive all keys will be lowercased.

get_keyword(word)¶

if word is present in keyword_trie_dict return the clean name for it.

Args:

wordstring: word that you want to check

Returns:

keywordstring: If word is present as it is in keyword_trie_dict then we return keyword mapped to it.

Examples:

>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.get('Big Apple')
>>> # New York

get_next_word(sentence)¶

Retrieve the next word in the sequence Iterate in the string until finding the first char not in non_word_boundaries

Args:

sentence (str): Line of text where we will look for the next word

Returns:

next_word (str): The next word in the sentence

Examples:

>>> from pyflashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('Big Apple')
>>> 'Big'

levensthein(word, max_cost=2, start_node=None)¶

Retrieve the nodes where there is a fuzzy match, via levenshtein distance, and with respect to max_cost

Args:

word (str): word to find a fuzzy match for max_cost (int): maximum levenshtein distance when performing the fuzzy match start_node (dict): Trie node from which the search is performed

Yields:

node, cost, depth (tuple): A tuple containing the final node,: the cost (i.e the distance), and the depth in the trie

Examples:

>>> from pyflashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor(case_sensitive=True)
>>> keyword_processor.add_keyword('Marie', 'Mary')
>>> next(keyword_processor.levensthein('Maria', max_cost=1))
>>> ({'_keyword_': 'Mary'}, 1, 5)
...
>>> keyword_processor = KeywordProcessor(case_sensitive=True
>>> keyword_processor.add_keyword('Marie Blanc', 'Mary')
>>> next(keyword_processor.levensthein('Mari', max_cost=1))
>>> ({' ': {'B': {'l': {'a': {'n': {'c': {'_keyword_': 'Mary'}}}}}}}, 1, 5)

remove_keyword(keyword)¶

To remove one or more keywords from the dictionary pass the keyword and the clean name it maps to.

Args:

keywordstring: keyword that you want to remove if it’s present

Returns:

statusbool: The return value. True for success, False otherwise.

Examples:

>>> keyword_processor.add_keyword('Big Apple')
>>> keyword_processor.remove_keyword('Big Apple')
>>> # Returns True
>>> # This case 'Big Apple' will no longer be a recognized keyword
>>> keyword_processor.remove_keyword('Big Apple')
>>> # Returns False

remove_keywords_from_dict(keyword_dict)¶

To remove keywords from a dictionary

Args:

keyword_dict (dict): A dictionary with str key and (list str) as value

Examples:

>>> keyword_dict = {
        "java": ["java_2e", "java programing"],
        "product management": ["PM", "product manager"]
    }
>>> keyword_processor.remove_keywords_from_dict(keyword_dict)

Raises:

AttributeError: If value for a key in keyword_dict is not a list.

remove_keywords_from_list(keyword_list)¶

To remove keywords present in list

Args:

keyword_list (list(str)): List of keywords to remove

Examples:

>>> keyword_processor.remove_keywords_from_list(["java", "python"]})

Raises:

AttributeError: If keyword_list is not a list.

replace_keywords(sentence, max_cost=0)¶

Searches in the string for all keywords present in corpus. Keywords present are replaced by the clean name and a new string is returned.

Args:

sentence (str): Line of text where we will replace keywords

Returns:

new_sentence (str): Line of text with replaced keywords

Examples:

>>> from pyflashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('Bay Area')
>>> new_sentence = keyword_processor.replace_keywords('I love Big Apple and bay area.')
>>> new_sentence
>>> 'I love New York and Bay Area.'

set_non_word_boundaries(non_word_boundaries)¶

set of characters that will be considered as part of word.

Args:

non_word_boundaries (set(str)):: Set of characters that will be considered as part of word.

KeywordProcessor Class Doc¶

Previous topic

This Page