mcetl.data_source

The DataSource class contains all needed information for importing, processing, and saving data.

@author: Donald Erb Created on Jul 31, 2020

Module Contents

Classes

DataSource

Used to give default settings for importing data and various functions based on the source.

class mcetl.data_source.DataSource(name, *, functions=None, column_labels=None, column_numbers=None, start_row=0, end_row=0, separator=None, file_type=None, num_files=1, unique_variables=None, unique_variable_indices=None, xy_plot_indices=None, figure_rcparams=None, excel_writer_styles=None, excel_row_offset=0, excel_column_offset=0, entry_separation=0, sample_separation=0, label_entries=True)

Used to give default settings for importing data and various functions based on the source.

Parameters
  • name (str) -- The name of the DataSource. Used when displaying the DataSource in a GUI.

  • functions (list or tuple, optional) -- A list or tuple of various Function objects (CalculationFunction or PreprocessFunction or SummaryFunction) that will be used to process data for the DataSource. The order the Functions are performed in is as follows: PreprocessFunctions, CalculationFunctions, SummaryFunctions, with functions of the same type being performed in the same order as input.

  • column_labels (tuple(str) or list(str), optional) -- A list/tuple of strings that will be used to label columns in the Excel output, and to label the pandas DataFrame columns for the data.

  • column_numbers (tuple(int) or list(int), optional) -- The indices of the columns to import from raw data files.

  • start_row (int, optional) -- The first row of data to use when importing from raw data files.

  • end_row (int, optional) -- The last row of data to use when importing from raw data files. Counts up from the last row, so the last row is 0, the second to last row is 1, etc.

  • separator (str, optional) -- The separator or delimeter to use to separate data columns when importing from raw data files. For example, ',' for csv files.

  • file_type (str, optional) -- The file extension associated with the data files for the DataSource. For example, 'txt' or 'csv'.

  • num_files (int, optional) -- The number of data files per sample for the DataSource. Only used when using keyword search for files.

  • unique_variables (list(str) or tuple(str), optional) -- The names of all columns from the imported raw data that are needed for calculations. For example, if importing thermogravimetric analysis (TGA) data, the unique_variables could be ['temperature', 'mass'].

  • unique_variable_indices (list(int) or tuple(int), optional) -- The indices of the columns within column_numbers that correspond with each of the input unique_variables.

  • xy_plot_indices (list(int, int) or tuple(int, int), optional) -- The indices of the columns after processing that will be the default columns for plotting in Excel.

  • figure_rcparams (dict, optional) -- A dictionary containing any changes to Matplotlib's rcParams to use if fitting or plotting.

  • excel_writer_styles (dict(str, None or dict or str or openpyxl.styles.named_styles.NamedStyle), optional) --

    A dictionary of styles used to format the output Excel workbook. The following keys are used when writing data from files to Excel:

    'header_even', 'header_odd', 'subheader_even', 'subheader_odd', 'columns_even', 'columns_odd'

    The following keys are used when writing data fit results to Excel:

    'fitting_header_even', 'fitting_header_odd', 'fitting_subheader_even', 'fitting_subheader_odd', 'fitting_columns_even', 'fitting_columns_odd', 'fitting_descriptors_even', 'fitting_descriptors_odd'

    The values for the dictionaries must be either dictionaries, with keys corresponding to keyword inputs for openpyxl's NamedStyle, or NamedStyle objects.

  • excel_row_offset (int, optional) -- The first row to use when writing to Excel. A value of 0 would start at row 1 in Excel, 1 would start at row 2, etc.

  • excel_column_offset (int, optional) -- The first column to use when writing to Excel. A value of 0 would start at column 'A' in Excel, 1 would start at column 'B', etc.

  • entry_separation (int, optional) -- The number of blank columns to insert between data entries when writing to Excel.

  • sample_separation (int, optional) -- The number of blank columns to insert between samples when writing to Excel.

  • label_entries (bool, optional) -- If True, will add a number to the column labels for each entry in a sample if there is more than one entry. For example, the column label 'data' would become 'data, 1', 'data, 2', etc.

excel_styles

A nested dictionary of dictionaries, used to create openpyxl NamedStyle objects to format the output Excel file.

Type

dict(dict)

lengths

A list of lists of lists of integers, corresponding to the number of columns in each individual entry in the total dataframes for the DataSource. Used to split the concatted dataframe back into individual dataframes for each dataset.

Type

list

references

A list of dictionaries, with each dictionary containing the column numbers for each unique variable and calculation for the merged dataframe of each dataset.

Type

list

Raises
  • ValueError -- Raised if the input name is a blank string, or if either excel_row_offset or excel_column_offset is < 0.

  • TypeError -- Raised if one of the input functions is not a valid mcetl.FunctionBase object.

  • IndexError -- Raised if the number of data columns is less than the number of unique variables.

merge_datasets(self, dataframes)

Merges all entries and samples into one dataframe for each dataset.

Also sets the length attribute, which will later be used to separate each dataframes back into individual dataframes for each entry.

Parameters

dataframes (list(list(list(pd.DataFrame)))) -- A nested list of list of lists of dataframes.

Returns

merged_dataframes -- A list of dataframes.

Return type

list(pd.DataFrame)

print_column_labels_template(self)

Convenience function that will print a template for all the column headers.

Column headers account for all of the columns imported from raw data, the columns added by CalculationFunctions, and the columns added by SummaryFunctions.

Returns

label_template -- The list of strings that serves as a template for the necessary input for column_labels for the DataSource.

Return type

list(str)

split_into_entries(self, merged_dataframes)

Splits the merged dataset dataframes back into dataframes for each entry.

Parameters

merged_dataframes (list(pd.DataFrame)) -- A list of dataframes. Each dataframe will be split into lists of lists of dataframes.

Returns

split_dataframes -- A list of lists of lists of dataframes, corresponding to entries and samples within each dataset.

Return type

list(list(list(pd.DataFrame)))

static test_excel_styles(styles)

Tests whether the input styles create valid Excel styles with openpyxl.

Parameters

styles (dict(str, None or dict or str or openpyxl.styles.named_styles.NamedStyle)) -- The dictionary of styles to test. Values in the dictionary can either be None, a nested dictionary with the necessary keys and values to create an openpyxl NamedStyle, a string (which would refer to another NamedStyle.name), or openpyxl.styles.NamedStyle objects.

Returns

Returns True if all input styles successfully create openpyxl NamedStyle objects; otherwise, returns False.

Return type

bool

Notes

This is just a wrapper of ExcelWriterHandler.test_excel_styles(), and is included because DataSource is the main-facing object of mcetl and will be used more often.