Data Processing¶
-
class
dataprofiler.labelers.data_processing.AutoSubRegistrationMeta(clsname, bases, attrs)¶ Bases:
abc.ABCMeta-
mro()¶ Return a type’s method resolution order.
-
register(subclass)¶ Register a virtual subclass of an ABC.
Returns the subclass, to allow usage as a class decorator.
-
-
class
dataprofiler.labelers.data_processing.BaseDataProcessor(**parameters)¶ Bases:
objectAbstract Data processing class.
-
processor_type= None¶
-
classmethod
get_class(class_name)¶
-
abstract classmethod
help()¶ Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.
- Returns
None
-
get_parameters(param_list=None)¶ Returns a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
-
set_params(**kwargs)¶ Given kwargs, set the parameters if they exist.
-
abstract
process(*args)¶ Data processing function.
-
classmethod
load_from_disk(dirpath)¶ Loads a data processor from a given path on disk
-
classmethod
load_from_library(name)¶ Loads a data processor from within the library
-
save_to_disk(dirpath)¶ Saves a data processor to a path on disk.
-
-
class
dataprofiler.labelers.data_processing.BaseDataPreprocessor(**parameters)¶ Bases:
dataprofiler.labelers.data_processing.BaseDataProcessorAbstract Data preprocessing class.
-
processor_type= 'preprocessor'¶
-
abstract
process(data, labels, label_mapping, batch_size)¶ Data preprocessing function.
-
classmethod
get_class(class_name)¶
-
get_parameters(param_list=None)¶ Returns a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
-
abstract classmethod
help()¶ Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.
- Returns
None
-
classmethod
load_from_disk(dirpath)¶ Loads a data processor from a given path on disk
-
classmethod
load_from_library(name)¶ Loads a data processor from within the library
-
save_to_disk(dirpath)¶ Saves a data processor to a path on disk.
-
set_params(**kwargs)¶ Given kwargs, set the parameters if they exist.
-
-
class
dataprofiler.labelers.data_processing.BaseDataPostprocessor(**parameters)¶ Bases:
dataprofiler.labelers.data_processing.BaseDataProcessorAbstract Data postprocessing class.
-
processor_type= 'postprocessor'¶
-
abstract
process(data, results, label_mapping)¶ Data postprocessing function.
-
classmethod
get_class(class_name)¶
-
get_parameters(param_list=None)¶ Returns a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
-
abstract classmethod
help()¶ Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.
- Returns
None
-
classmethod
load_from_disk(dirpath)¶ Loads a data processor from a given path on disk
-
classmethod
load_from_library(name)¶ Loads a data processor from within the library
-
save_to_disk(dirpath)¶ Saves a data processor to a path on disk.
-
set_params(**kwargs)¶ Given kwargs, set the parameters if they exist.
-
-
class
dataprofiler.labelers.data_processing.DirectPassPreprocessor¶ Bases:
dataprofiler.labelers.data_processing.BaseDataPreprocessorInitialize the DirectPassPreprocessor class
-
classmethod
help()¶ Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.
- Returns
None
-
process(data, labels=None, label_mapping=None, batch_size=None)¶ Data preprocessing function.
-
classmethod
get_class(class_name)¶
-
get_parameters(param_list=None)¶ Returns a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
-
classmethod
load_from_disk(dirpath)¶ Loads a data processor from a given path on disk
-
classmethod
load_from_library(name)¶ Loads a data processor from within the library
-
processor_type= 'preprocessor'¶
-
save_to_disk(dirpath)¶ Saves a data processor to a path on disk.
-
set_params(**kwargs)¶ Given kwargs, set the parameters if they exist.
-
classmethod
-
class
dataprofiler.labelers.data_processing.CharPreprocessor(max_length=3400, default_label='UNKNOWN', pad_label='PAD', flatten_split=0, flatten_separator=' ', is_separate_at_max_len=False)¶ Bases:
dataprofiler.labelers.data_processing.BaseDataPreprocessorInitialize the CharPreprocessor class
- Parameters
max_length (int) – Maximum char length in a sample.
default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label
pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label
flatten_split (float) – approximate output of split between flattened and non-flattened characters, value between [0, 1]. When the current flattened split becomes more than the flatten_split value, any leftover sample or subsequent samples will be non-flattened until the current flattened split is below the flatten_split value
flatten_separator (str) – separator used to put between flattened samples.
is_separate_at_max_len (bool) – if true, separates at max_length, otherwise at nearest separator
-
classmethod
help()¶ Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.
- Returns
None
-
process(data, labels=None, label_mapping=None, batch_size=32)¶ Flatten batches of data
- Parameters
data (numpy.ndarray) – List of strings to create embeddings for
labels (numpy.ndarray) – labels for each input character
label_mapping (Union[None, dict]) – maps labels to their encoded integers
batch_size (int) – Number of samples in the batch of data
- Return batch_data
A dict containing samples of size batch_size
- Rtype batch_data
dicts
-
classmethod
get_class(class_name)¶
-
get_parameters(param_list=None)¶ Returns a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
-
classmethod
load_from_disk(dirpath)¶ Loads a data processor from a given path on disk
-
classmethod
load_from_library(name)¶ Loads a data processor from within the library
-
processor_type= 'preprocessor'¶
-
save_to_disk(dirpath)¶ Saves a data processor to a path on disk.
-
set_params(**kwargs)¶ Given kwargs, set the parameters if they exist.
-
class
dataprofiler.labelers.data_processing.CharPostprocessor(default_label='UNKNOWN', pad_label='PAD', flatten_separator=' ', use_word_level_argmax=False, output_format='character_argmax', separators=(' ', ',', ';', "'", '"', ':', '\n', '\t', '.'), word_level_min_percent=0.75)¶ Bases:
dataprofiler.labelers.data_processing.BaseDataPostprocessorInitialize the CharPostprocessor class
- Parameters
default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label
pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label
flatten_separator (str) – separator used to put between flattened samples.
use_word_level_argmax (bool) – whether to require the argmax value of each character in a word to determine the word’s entity
output_format (str) – (character_argmax vs NER) where character_argmax is a list of encodings for each character in the input text and NER is in the dict format which specifies start,end,label for each entity in a sentence
separators (tuple(str)) – list of characters to use for separating words within the character predictions
word_level_min_percent (float) – threshold on generating dominant word_level labeling
-
classmethod
help()¶ Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.
- Returns
None
-
static
convert_to_NER_format(predictions, label_mapping, default_label, pad_label)¶ Converts word level predictions to specified format
- Parameters
predictions (list) – predictions
label_mapping (dict) – labels and corresponding integers
default_label (str) – default label in label_mapping
pad_label (str) – pad label in label_mapping
- Returns
formatted predictions
- Return type
list
-
static
match_sentence_lengths(data, results, flatten_separator, inplace=True)¶ Converts the results from the model into the same ragged data shapes as the original data.
- Parameters
data (numpy.ndarray) – original input data to the data labeler
results (dict) – dict of model character level predictions and confs
flatten_separator (str) – string which joins to samples together when flattening
inplace (bool) – flag to modify results in place
- Returns
dict(pred=…) or dict(pred=…, conf=…)
-
process(data, results, label_mapping)¶ Conducts the processing on the data given the predictions, label_mapping, and default_label.
- Parameters
data (np.ndarray) – original input data to the data labeler
results (dict) – dict of model character level predictions and confs
label_mapping (dict) – labels and corresponding integers
- Returns
dict of predictions and if they exist, confidences
-
classmethod
get_class(class_name)¶
-
get_parameters(param_list=None)¶ Returns a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
-
classmethod
load_from_disk(dirpath)¶ Loads a data processor from a given path on disk
-
classmethod
load_from_library(name)¶ Loads a data processor from within the library
-
processor_type= 'postprocessor'¶
-
save_to_disk(dirpath)¶ Saves a data processor to a path on disk.
-
set_params(**kwargs)¶ Given kwargs, set the parameters if they exist.
-
class
dataprofiler.labelers.data_processing.StructCharPreprocessor(max_length=3400, default_label='UNKNOWN', pad_label='PAD', flatten_separator='\x01\x01\x01\x01\x01', is_separate_at_max_len=False)¶ Bases:
dataprofiler.labelers.data_processing.CharPreprocessorInitialize the StructCharPreprocessor class
- Parameters
max_length (int) – Maximum char length in a sample.
default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label
pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label
flatten_separator (str) – separator used to put between flattened samples.
is_separate_at_max_len (bool) – if true, separates at max_length, otherwise at nearest separator
-
classmethod
help()¶ Help function describing alterable parameters, input data formats for preprocessors, and output data formats for preprocessors.
- Returns
None
-
get_parameters(param_list=None)¶ Returns a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
-
convert_to_unstructured_format(data, labels)¶ Converts the list of data samples into the CharPreprocessor required input data format.
- Parameters
data (numpy.ndarray) – list of strings
labels (list) – labels for each input character
- Returns
data in the following format text=”<SAMPLE><SEPARATOR><SAMPLE>…”, entities=[(start=<INT>, end=<INT>, label=”<LABEL>”),
…(num_samples in data)])
-
process(data, labels=None, label_mapping=None, batch_size=32)¶ Process structured data for being processed by the CharacterLevelCnnModel.
- Parameters
data (numpy.ndarray) – List of strings to create embeddings for
labels (numpy.ndarray) – labels for each input character
label_mapping (Union[dict, None]) – maps labels to their encoded integers
batch_size (int) – Number of samples in the batch of data
- Return batch_data
A dict containing samples of size batch_size
- Rtype batch_data
dict
-
classmethod
get_class(class_name)¶
-
classmethod
load_from_disk(dirpath)¶ Loads a data processor from a given path on disk
-
classmethod
load_from_library(name)¶ Loads a data processor from within the library
-
processor_type= 'preprocessor'¶
-
save_to_disk(dirpath)¶ Saves a data processor to a path on disk.
-
set_params(**kwargs)¶ Given kwargs, set the parameters if they exist.
-
class
dataprofiler.labelers.data_processing.StructCharPostprocessor(default_label='UNKNOWN', pad_label='PAD', flatten_separator='\x01\x01\x01\x01\x01', random_state=None)¶ Bases:
dataprofiler.labelers.data_processing.BaseDataPostprocessorInitialize the StructCharPostprocessor class
- Parameters
default_label (str) – Key for label_mapping that is the default label
pad_label (str) – Key for label_mapping that is the pad label
flatten_separator (str) – separator used to put between flattened samples.
random_state (random.Random) – random state setting to be used for randomly selecting a prediction when two labels have equal opportunity for a given sample.
-
classmethod
help()¶ Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.
- Returns
None
-
static
match_sentence_lengths(data, results, flatten_separator, inplace=True)¶ Converts the results from the model into the same ragged data shapes as the original data.
- Parameters
data (np.ndarray) – original input data to the data labeler
results (dict) – dict of model character level predictions and confs
flatten_separator (str) – string which joins to samples together when flattening
inplace (bool) – flag to modify results in place
- Returns
dict(pred=…) or dict(pred=…, conf=…)
-
convert_to_structured_analysis(sentences, results, label_mapping, default_label, pad_label)¶ Converts unstructured results to a structured column analysis assuming the column was flattened into a single sample. This takes the mode of all character predictions except for the separator labels. In cases of tie, chose anything but background, otherwise randomly choose between the remaining labels.
- Parameters
sentences (list(str)) – samples which were predicted upon
results (dict) – character predictions for each sample return from model
label_mapping (dict) – maps labels to their encoded integers
default_label (str) – Key for label_mapping that is the default label
pad_label (str) – Key for label_mapping that is the pad label
- Returns
prediction value for a single column
-
process(data, results, label_mapping)¶ Postprocessing of CharacterLevelCnnModel results when given structured data processed by StructCharPreprocessor.
- Parameters
data (Union[numpy.ndarray, pandas.DataFrame]) – original input data to the data labeler
results – dict of model character level predictions and confs
results – dict
label_mapping (dict) – maps labels to their encoded integers
- Returns
dict of predictions and if they exist, confidences
- Return type
dict
-
classmethod
get_class(class_name)¶
-
get_parameters(param_list=None)¶ Returns a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
-
classmethod
load_from_disk(dirpath)¶ Loads a data processor from a given path on disk
-
classmethod
load_from_library(name)¶ Loads a data processor from within the library
-
processor_type= 'postprocessor'¶
-
save_to_disk(dirpath)¶ Saves a data processor to a path on disk.
-
set_params(**kwargs)¶ Given kwargs, set the parameters if they exist.
-
class
dataprofiler.labelers.data_processing.RegexPostProcessor(aggregation_func='split', priority_order=None, random_state=None)¶ Bases:
dataprofiler.labelers.data_processing.BaseDataPostprocessorInitialize the RegexPostProcessor class
- Parameters
aggregation_func (str) – aggregation function to apply to regex model output (split, random, priority)
priority_order (Union[list, numpy.ndarray]) – if priority is set as the aggregation function, the order in which entities are given priority must be set
random_state (random.Random) – random state setting to be used for randomly selecting a prediction when two labels have equal opportunity for a given sample.
-
classmethod
help()¶ Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.
- Returns
None
-
static
priority_prediction(results, entity_priority_order)¶ Aggregation function using priority of regex to give entity determination.
- Parameters
results (dict) – regex from model in format: dict(pred=…, conf=…)
entity_priority_order (np.ndarray) – list of entity priorities (lowest has higher priority)
- Returns
aggregated predictions
-
static
split_prediction(results)¶ Splits the prediction across votes. :param results: regex from model in format: dict(pred=…, conf=…) :type results: dict :return: aggregated predictions
-
process(data, labels=None, label_mapping=None, batch_size=None)¶ Data preprocessing function.
-
classmethod
get_class(class_name)¶
-
get_parameters(param_list=None)¶ Returns a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
-
classmethod
load_from_disk(dirpath)¶ Loads a data processor from a given path on disk
-
classmethod
load_from_library(name)¶ Loads a data processor from within the library
-
processor_type= 'postprocessor'¶
-
save_to_disk(dirpath)¶ Saves a data processor to a path on disk.
-
set_params(**kwargs)¶ Given kwargs, set the parameters if they exist.