Creating Function Objects

Function objects are simple wrappers around functions that allow referencing the function by a name, specifying which columns the function should act on, and other relevant information about the function's behavior. Function objects come in three types:

  • PreprocessFunctions: process each data entry as soon as it is imported from the files

  • CalculationFunctions: process each entry in each sample in each dataset

  • SummaryFunctions: process each sample or each dataset; performed last

Note that all functions will process pandas DataFrames. To speed up calculations within the functions, it is suggested to make use of vectorization of the pandas DataFrame or Series or of the underlying numpy arrays, since vectorization can be significantly faster than iterating over each index using a for loop.

This tutorial will cover some basic examples for the three Function types. To see more advanced examples, see the example programs in the GitHub respository. The use_main_gui.py example program shows many different uses of the three Function objects, and the creating_functions.py example program walks through the internals of how launch_main_gui() calls the Functions so that each individual step can be understood.

Note

For functions of all three Function types, it is a good idea to add **kwargs to the function input. In later versions of mcetl, additional items may be passed to functions, but they will always be added as keyword arguments (ie. passed as name=value). By adding **kwargs, any unwanted keyword arguments for functions will be ignored and not cause issues when upgrading mcetl.

PreprocessFunctions

A PreprocessFunction will perform its function on each data entry individually. Example usage of a PreprocessFunction can be to split a single data entry into multiple entries, collect information on all data entries in each sample in each dataset for usage later, or just simple processes that are easier to do per entry rather than when each dataset is grouped together.

The function for a PreprocessFunction must take at least two arguments: the dataframe containing the data for the entry, and a list of indices that tell which columns of the dataframe contain the data used by the function. The function must return a list of dataframes after processing, even if only one dataframe is used within the function.

A simple function to split dataframes based on the segment number, and then remove the segment number column is shown below.

import mcetl
import numpy as np

def split_segments(df, target_indices, **kwargs):
    """
    Preprocess function that separates each entry based on the segment number.

    Also removes the segment column after processing since it is not needed
    in the final output.

    Parameters
    ----------
    df : pandas.DataFrame
        The dataframe of the entry.
    target_indices : list(int)
        The indices of the target columns.

    Returns
    -------
    output_dataframes : list(pandas.DataFrame)
        The list of dataframes after splitting by segment number.

    """

    segment_index = target_indices[0]
    segment_col = df[df.columns[segment_index]].to_numpy()
    mask = np.where(segment_col[:-1] != segment_col[1:])[0] + 1 # + 1 since mask loses one index

    output_dataframes = np.array_split(df, mask)

    for dataframe in output_dataframes:
        dataframe.drop(segment_index, 1, inplace=True)

    return output_dataframes

# targets the 'segment' column from the imported data
segment_separator = mcetl.PreprocessFunction(
    name='segment_separator', target_columns='segment',
    function=split_segments, deleted_columns=['segment']
)

In addition, PreprocessFunctions can be used for procedures that are easier to perform on each data entry separately, rather than when all of the data is collected together. For example, sorting each entry based on the values in one of its columns.

def sort_columns(df, target_indices, **kwargs):
    """Sorts the dataframe based on the values of the target column."""
    return [df.sort_values(target_indices[0])]

# targets the 'diameter' column from the imported data
pore_preprocess = mcetl.PreprocessFunction('sort_data', 'diameter', sort_columns)

CalculationFunctions

A CalculationFunction will perform its function on each merged dataset. Each merged dataset is composed of all data entries for each sample concatenated together, resembling how the dataset will look when written to an Excel sheet. This makes the functions more difficult to create since target columns are given as nested lists of lists for each sample, but allows access to all data in a dataset, if required for complex calculations.

Each CalculationFunction can have two functions: one for performing the calculations on data for Excel and one for performing calculations on data for Python. This way, the Excel calculations can create strings to match Excel-specific functions, like using '=SUM(A1:A3)' to make the data dynamic within the Excel workbook, while the Python functions can calculate the actual numerical data (eg. sum(data[0:2]) to match the previous Excel formula). If only a single, numerical calculation is desired, regardless of whether the data is being output to Excel or Python, then only a single function needs to be specified (an example of such a function is given below).

The functions for a CalculationFunction must take at least three arguments: the dataframe containing the data for the dataset, a list of lists of integers that tell which columns in the dataframe contain the data used by the function, and a list of lists of integers that tell which columns in the dataframe are used for the output of the function. Additionally, two keyword arguments are passed to the function: excel_columns, which is a list of strings corresponding to the columns in Excel used by the dataset (eg. ['A', 'B', 'C', 'D']) if doing Excel functions and is None if doing Python functions, and first_row, which is an integer telling the first row in Excel that the data will begin on (3 by default, since the first row is the sample name and the second is the column name). The functions must return a DataFrame after processing, even if all changes to the input DataFrame were done in place.

A simple function that adds a column of offset data is shown below.

import mcetl
import numpy as np

def offset_data_excel(df, target_indices, calc_indices, excel_columns,
                      first_row, offset, **kwargs):
    """Creates a string that will add an offset to data in Excel."""

    total_count = 0
    for i, sample in enumerate(calc_indices):
        for j, calc_col in enumerate(sample):
            y = df[target_indices[0][i][j]]
            y_col = excel_columns[target_indices[0][i][j]]
            calc = [
                f'= {y_col}{k + first_row} + {offset * total_count}' for k in range(len(y))
            ]
            # use np.where(~np.isnan(y)) so that the calculation works for unequally-sized
            # datasets
            df[calc_col] = np.where(~np.isnan(y), calc, None)
            total_count += 1

    return df

def offset_data_python(df, target_indices, calc_indices, first_row, **kwargs):
    """Adds an offset to data."""

    total_count = 0
    for i, sample in enumerate(calc_indices):
        for j, calc_col in enumerate(sample):
            y = df[target_indices[0][i][j]]
            df[calc_col] = y + (kwargs['offset'] * total_count)
            total_count += 1

    return df

# targets the 'data' column from the imported data
offset = mcetl.CalculationFunction(
    name='offset', target_columns='data', functions=(offset_data_excel, offset_data_python),
    added_columns=1, function_kwargs={'offset': 10}
)

Alternatively, the two functions could be combined into one, and the calculation route could be decided by examining the value of the excel_columns input, which is a list of strings if processing for Excel, and None when processing for Python.

import mcetl
import numpy as np

def offset_data(df, target_indices, calc_indices, excel_columns,
                first_row, offset, **kwargs):
    """Adds an offset to data."""

    total_count = 0
    for i, sample in enumerate(calc_indices):
        for j, calc_col in enumerate(sample):
            if excel_columns is not None: # do Excel calculations
                y = df[target_indices[0][i][j]]
                y_col = excel_columns[target_indices[0][i][j]]
                calc = [
                    f'= {y_col}{k + first_row} + {offset * total_count}' for k in range(len(y))
                ]
                df[calc_col] = np.where(~np.isnan(y), calc, None)

            else: # do Python calculations
                y = df[target_indices[0][i][j]]
                df[calc_col] = y + (offset * total_count)
            total_count += 1

    return df

# targets the 'data' column from the imported data
offset = mcetl.CalculationFunction(
    name='offset', target_columns='data', functions=offset_data,
    added_columns=1, function_kwargs={'offset': 10}
)

To modify the contents of an existing column, the input for added_columns for CalculationFunction should be a string designating the target: either a variable from the imported data, or the name of a CalculationFunction.

import mcetl
import numpy as np

def normalize(df, target_indices, calc_indices, excel_columns, first_row, **kwargs):
    """Performs a min-max normalization to bound values between 0 and 1."""

    for i, sample in enumerate(calc_indices):
        for j, calc_col in enumerate(sample):
            if excel_columns is not None:
                y = df[target_indices[0][i][j]]
                y_col = excel_columns[target_indices[0][i][j]]
                end = y.count() + 2
                calc = [
                    (f'=({y_col}{k + first_row} - MIN({y_col}$3:{y_col}${end})) / '
                     f'(MAX({y_col}$3:{y_col}${end}) - MIN({y_col}$3:{y_col}${end}))')
                    for k in range(len(y))
                ]

                df[calc_col] = np.where(~np.isnan(y), calc, None)

            else:
                y_col = df.columns[target_indices[0][i][j]]
                min_y = df[y_col].min()
                max_y = df[y_col].max()

                df[calc_col] = (df[y_col] - min_y) / (max_y - min_y)

    return df

def offset_normalized(df, target_indices, calc_indices, excel_columns,
                      offset, **kwargs):
    """Adds an offset to normalized data."""

    total_count = 0
    for i, sample in enumerate(calc_indices):
        for j, calc_col in enumerate(sample):
            y_col = df[target_indices[0][i][j]]
            offset_amount = offset * total_count
            if excel_columns is not None:
                df[calc_col] = y_col + f' + {offset_amount}'
            else:
                df[calc_col] = y_col + offset_amount
            total_count += 1

    return df

# targets the 'data' column from the imported data
normalize_func = mcetl.CalculationFunction(
    name='normalize', target_columns='data',
    functions=normalize, added_columns=1
)
# targets the 'normalize' column from the the 'normalize' CalculationFunction
# and also alters its contents
offset_func = mcetl.CalculationFunction(
    name='offset', target_columns='normalize', functions=offset_normalized,
    added_columns='normalize', function_kwargs={'offset': 10}
)

If the CalculationFunction does the same calculation, regardless of whether the data is going to Excel or for later processing in Python, then a mutable object, like a list, can be used in function_kwargs to signify that the calculation has been performed to prevent processing twice.

import mcetl

def offset_numerical(df, target_indices, calc_indices, excel_columns, **kwargs):
    """Adds a numerical offset to data."""

    # Add this section to prevent doing numerical calculations twice.
    if excel_columns is None and kwargs['processed'][0]:
        return df # return to prevent processing twice
    elif excel_columns is not None:
        kwargs['processed'][0] = True

    # Regular calculation section
    offset = kwargs['offset']
    total_count = 0
    for i, sample in enumerate(calc_indices):
        for j, calc_col in enumerate(sample):
            df[calc_col] = df[target_indices[0][i][j]] + (offset * total_count)
            total_count += 1

    return df

# targets the 'data' column from the imported data
numerical_offset = mcetl.CalculationFunction(
    name='numerical offset', target_columns='data', functions=offset_numerical,
    added_columns=1, function_kwargs={'offset': 10, 'processed': [False]}
)

SummaryFunctions

A SummaryFunction is very similar to CalculationFunctions, performing its functions on each merged dataset and requiring outputting a single DataFrame. However, SummaryFunctions differ from CalculationFunctions in that their added columns are not within the data entries themselves. Instead, SummaryFunctions can either be a sample SummaryFunction (by using sample_summary=True when creating the object), which is equivalent to appending a data entry to each sample in each dataset, or a dataset SummaryFunction (by using sample_summary=False when creating the object), which is equivalent to appending a sample to each dataset.

For example, consider calculating the elatic modulus from tensile tests. Each sample in the dataset will have multiple measurements/entries, so a sample SummaryFunction could be used to calculate the average elastic modulus for each sample, and a dataset SummaryFunction could be used to create a table listing the average elastic modulus for each sample in the dataset for easy referencing.

import mcetl
import numpy as np
import pandas as pd
from scipy import optimize

def stress_model(strain, modulus):
    """
    The linear estimate of the stress-strain curve using the strain and estimated modulus.

    Parameters
    ----------
    strain : array-like
        The array of experimental strain values, unitless (with cancelled
        units, such as mm/mm).
    modulus : float
        The estimated elastic modulus for the data, with units of GPa (Pa * 10**9).

    Returns
    -------
    array-like
        The estimated stress data following the linear model, with units of Pa.

    """
    return strain * modulus * 1e9


def tensile_calculation(df, target_indices, calc_indices, excel_columns, **kwargs):
    """Calculates the elastic modulus from the stress-strain curve for each entry."""

    if excel_columns is None and kwargs['processed'][0]:
        return df # return to prevent processing twice
    elif excel_columns is not None:
        kwargs['processed'][0] = True

    num_columns = 2 # the number of calculation columns per entry
    for i, sample in enumerate(calc_indices):
        for j in range(len(sample) // num_columns):
            strain_index = target_indices[0][i][j]
            stress_index = target_indices[1][i][j]
            nan_mask = (~np.isnan(df[strain_index])) & (~np.isnan(df[stress_index]))

            # to convert strain from % to unitless
            strain = df[strain_index].to_numpy()[nan_mask] / 100
             # to convert stress from MPa to Pa
            stress = df[stress_index].to_numpy()[nan_mask] * 1e6

            # only use data where stress varies linearly with respect to strain
            linear_mask = (
                (strain >= kwargs['lower_limit']) & (strain <= kwargs['upper_limit'])
            )
            initial_guess = 80 # initial guess of the elastic modulus, in GPa

            modulus, covariance = optimize.curve_fit(
                stress_model, strain[linear_mask], stress[linear_mask],
                p0=[initial_guess]
            )

            df[sample[0 + (j * num_columns)]] = pd.Series(('Value', 'Standard Error'))
            df[sample[1 + (j * num_columns)]] = pd.Series(
                (modulus[0], np.sqrt(np.diag(covariance)[0]))
            )

    return df


def tensile_sample_summary(df, target_indices, calc_indices, excel_columns, **kwargs):
    """Summarizes the mechanical properties for each sample."""

    if excel_columns is None and kwargs['processed'][0]:
        return df # to prevent processing twice

    num_cols = 2 # the number of calculation columns per entry from tensile_calculation
    for i, sample in enumerate(calc_indices):
        if not sample: # skip empty lists
            continue

        entries = []
        for j in range(len(target_indices[0][i]) // num_cols):
            entries.append(target_indices[0][i][j * num_cols:(j + 1) * num_cols])

        df[sample[0]] = pd.Series(['Elastic Modulus (GPa)'])
        df[sample[1]] = pd.Series([np.mean([df[entry[1]][0] for entry in entries])])
        df[sample[2]] = pd.Series([np.std([df[entry[1]][0] for entry in entries])])

    return df


def tensile_dataset_summary(df, target_indices, calc_indices, excel_columns, **kwargs):
    """Summarizes the mechanical properties for each dataset."""

    if excel_columns is None and kwargs['processed'][0]:
        return df # to prevent processing twice

    # the number of samples is the number of lists in calc_indices - 1
    num_samples = len(calc_indices[:-1])

    # calc index is -1 since only the last dataframe is the dataset summary dataframe
    df[calc_indices[-1][0]] = pd.Series(
        [''] + [f'Sample {num + 1}' for num in range(num_samples)]
    )
    df[calc_indices[-1][1]] = pd.Series(
        ['Average'] + [df[indices[1]][0] for indices in target_indices[0][:-1]]
    )
    df[calc_indices[-1][2]] = pd.Series(
        ['Standard Deviation'] + [df[indices[2]][0] for indices in target_indices[0][:-1]]
    )

    return df

# share the keyword arguments between all function objects
tensile_kwargs = {'lower_limit': 0.0015, 'upper_limit': 0.005, 'processed': [False]}

# targets the 'data' column from the imported data
tensile_calc = mcetl.CalculationFunction(
    name='tensile calc', target_columns=['strain', 'stress'],
    functions=tensile_calculation, added_columns=2,
    function_kwargs=tensile_kwargs
)
# targets the columns from the the 'tensile calc' CalculationFunction
stress_sample_summary = mcetl.SummaryFunction(
    name='tensile sample summary', target_columns=['tensile calc'],
    functions=tensile_sample_summary, added_columns=3,
    function_kwargs=tensile_kwargs, sample_summary=True
)
# targets the columns from the the 'tensile sample summary' SummaryFunction
stress_dataset_summary = mcetl.SummaryFunction(
    name='tensile dataset summary', target_columns=['tensile sample summary'],
    functions=tensile_dataset_summary, added_columns=3,
    function_kwargs=tensile_kwargs, sample_summary=False
)