tsfresh.utilities package¶
Submodules¶
tsfresh.utilities.dataframe_functions module¶
Utility functions for handling the DataFrame conversions to the internal normalized format
(see normalize_input_to_internal_representation
) or on how to handle NaN
and inf
in the DataFrames.

tsfresh.utilities.dataframe_functions.
add_sub_time_series_index
(df_or_dict, sub_length, column_id=None, column_sort=None, column_kind=None)[source]¶ Add a column “id” which contains:
 if column_id is None: for each kind (or if column_kind is None for the full dataframe) a new index built by “subpackaging” the data in packages of length “sub_length”. For example if you have data with the length of 11 and sub_length is 2, you will get 6 new packages: 0, 0; 1, 1; 2, 2; 3, 3; 4, 4; 5.
 if column_id is not None: the same as before, just for each id separately. The old column_id values are added to the new “id” column after a comma
You can use this functions to turn a long measurement into subpackages, where you want to extract features on.
Parameters:  df_or_dict (pandas.DataFrame or dict) – a pandas DataFrame or a dictionary. The required shape/form of the object depends on the rest of the passed arguments.
 column_id (basestring or None) – it must be present in the pandas DataFrame or in all DataFrames in the dictionary. It is not allowed to have NaN values in this column.
 column_sort (basestring or None) – if not None, sort the rows by this column. It is not allowed to have NaN values in this column.
 column_kind (basestring or None) – It can only be used when passing a pandas DataFrame (the dictionary is already assumed to be grouped by the kind). Is must be present in the DataFrame and no NaN values are allowed. If the kind column is not passed, it is assumed that each column in the pandas DataFrame (except the id or sort column) is a possible kind.
Returns: The data frame or dictionary of data frames with a column “id” added
Return type: the one from df_or_dict

tsfresh.utilities.dataframe_functions.
check_for_nans_in_columns
(df, columns=None)[source]¶ Helper function to check for
NaN
in the data frame and raise aValueError
if there is one.Parameters:  df (pandas.DataFrame) – the pandas DataFrame to test for NaNs
 columns (list) – a list of columns to test for NaNs. If left empty, all columns of the DataFrame will be tested.
Returns: None
Return type: Raise: ValueError
ofNaNs
are found in the DataFrame.

tsfresh.utilities.dataframe_functions.
get_ids
(df_or_dict, column_id)[source]¶ Aggregates all ids in column_id from the time series container `
Parameters:  df_or_dict (pandas.DataFrame or dict) – a pandas DataFrame or a dictionary.
 column_id (basestring) – it must be present in the pandas DataFrame or in all DataFrames in the dictionary. It is not allowed to have NaN values in this column.
Returns: as set with all existing ids in energy_ratio_by_chunks
Return type: Set
Raise: TypeError
if df_or_dict is not of type dict or pandas.DataFrame

tsfresh.utilities.dataframe_functions.
get_range_values_per_column
(df)[source]¶ Retrieves the finite max, min and mean values per column in the DataFrame df and stores them in three dictionaries. Those dictionaries col_to_max, col_to_min, col_to_median map the columnname to the maximal, minimal or median value of that column.
If a column does not contain any finite values at all, a 0 is stored instead.
Parameters: df (pandas.DataFrame) – the Dataframe to get columnswise max, min and median from Returns: Dictionaries mapping column names to max, min, mean values Return type: (dict, dict, dict)

tsfresh.utilities.dataframe_functions.
impute
(df_impute)[source]¶ Columnwise replaces all
NaNs
andinfs
from the DataFrame df_impute with average/extreme values from the same columns. This is done as follows: Each occurringinf
orNaN
in df_impute is replaced byinf
>min
+inf
>max
NaN
>median
If the column does not contain finite values at all, it is filled with zeros.
This function modifies df_impute in place. After that, df_impute is guaranteed to not contain any nonfinite values. Also, all columns will be guaranteed to be of type
np.float64
.Parameters: df_impute (pandas.DataFrame) – DataFrame to impute Return df_impute: imputed DataFrame Rtype df_impute: pandas.DataFrame

tsfresh.utilities.dataframe_functions.
impute_dataframe_range
(df_impute, col_to_max, col_to_min, col_to_median)[source]¶ Columnwise replaces all
NaNs
,inf
and+inf
from the DataFrame df_impute with average/extreme values from the provided dictionaries.This is done as follows: Each occurring
inf
orNaN
in df_impute is replaced byinf
> by value in col_to_min+inf
> by value in col_to_maxNaN
> by value in col_to_median
If a column of df_impute is not found in the one of the dictionaries, this method will raise a ValueError. Also, if one of the values to replace is not finite a ValueError is returned
This function modifies df_impute in place. Afterwards df_impute is guaranteed to not contain any nonfinite values. Also, all columns will be guaranteed to be of type
np.float64
.Parameters:  df_impute (pandas.DataFrame) – DataFrame to impute
 col_to_max (dict) – Dictionary mapping column names to max values
 col_to_min – Dictionary mapping column names to min values
 col_to_median – Dictionary mapping column names to median values
Return df_impute: imputed DataFrame
Rtype df_impute: pandas.DataFrame
Raises: ValueError – if a column of df_impute is missing in col_to_max, col_to_min or col_to_median or a value to replace is non finite

tsfresh.utilities.dataframe_functions.
impute_dataframe_zero
(df_impute)[source]¶ Replaces all
NaNs
,infs
and+infs
from the DataFrame df_impute with 0s. The df_impute will be modified in place. All its columns will be into converted into dtypenp.float64
.Parameters: df_impute (pandas.DataFrame) – DataFrame to impute Return df_impute: imputed DataFrame Rtype df_impute: pandas.DataFrame

tsfresh.utilities.dataframe_functions.
make_forecasting_frame
(x, kind, max_timeshift, rolling_direction)[source]¶ Takes a singular time series x and constructs a DataFrame df and target vector y that can be used for a time series forecasting task.
The returned df will contain, for every time stamp in x, the last max_timeshift data points as a new time series, such can be used to fit a time series forecasting model.
See Rolling/Time series forecasting for a detailed description of the rolling process and how the feature matrix and target vector are derived.
The returned time series container df, will contain the rolled time series as a flat data frame, the first format from Data Formats.
When x is a pandas.Series, the index will be used as id.
Parameters:  x (np.array or pd.Series) – the singular time series
 kind (str) – the kind of the time series
 rolling_direction (int) – The sign decides, if to roll backwards (if sign is positive) or forwards in “time”
 max_timeshift (int) – If not None, shift only up to max_timeshift. If None, shift as often as possible.
Returns: time series container df, target vector y
Return type: (pd.DataFrame, pd.Series)

tsfresh.utilities.dataframe_functions.
restrict_input_to_index
(df_or_dict, column_id, index)[source]¶ Restrict df_or_dict to those ids contained in index.
Parameters:  df_or_dict (pandas.DataFrame or dict) – a pandas DataFrame or a dictionary.
 column_id (basestring) – it must be present in the pandas DataFrame or in all DataFrames in the dictionary. It is not allowed to have NaN values in this column.
 index (Iterable or pandas.Series) – Index containing the ids
Return df_or_dict_restricted: the restricted df_or_dict
Rtype df_or_dict_restricted: dict or pandas.DataFrame
Raise: TypeError
if df_or_dict is not of type dict or pandas.DataFrame

tsfresh.utilities.dataframe_functions.
roll_time_series
(df_or_dict, column_id, column_sort=None, column_kind=None, rolling_direction=1, max_timeshift=None, min_timeshift=0, chunksize=None, n_jobs=1, show_warnings=False, disable_progressbar=False, distributor=None)[source]¶ This method creates sub windows of the time series. It rolls the (sorted) data frames for each kind and each id separately in the “time” domain (which is represented by the sort order of the sort column given by column_sort).
For each rolling step, a new id is created by the scheme ({id}, {shift}), here id is the former id of the column and shift is the amount of “time” shifts. You can think of it as having a window of fixed length (the max_timeshift) moving one step at a time over your time series. Each cutout seen by the window is a new time series with a new identifier.
A few remarks:
 This method will create new IDs!
 The sign of rolling defines the direction of time rolling, a positive value means we are shifting the cutout window foreward in time. The name of each new sub time series is given by the last time point. This means, the time series named ([id=]4,[timeshift=]5) with a max_timeshift of 3 includes the data of the times 3, 4 and 5. A negative rolling direction means, you go in negative time direction over your data. The time series named ([id=]4,[timeshift=]5) with max_timeshift of 3 would then include the data of the times 5, 6 and 7. The absolute value defines how much time to shift at each step.
 It is possible to shift time series of different lengths, but:
 We assume that the time series are uniformly sampled
 For more information, please see Rolling/Time series forecasting.
Parameters:  df_or_dict (pandas.DataFrame or dict) – a pandas DataFrame or a dictionary. The required shape/form of the object depends on the rest of the passed arguments.
 column_id (basestring) – it must be present in the pandas DataFrame or in all DataFrames in the dictionary. It is not allowed to have NaN values in this column.
 column_sort (basestring or None) – if not None, sort the rows by this column. It is not allowed to have NaN values in this column. If not given, will be filled by an increasing number, meaning that the order of the passed dataframes are used as “time” for the time series.
 column_kind (basestring or None) – It can only be used when passing a pandas DataFrame (the dictionary is already assumed to be grouped by the kind). Is must be present in the DataFrame and no NaN values are allowed. If the kind column is not passed, it is assumed that each column in the pandas DataFrame (except the id or sort column) is a possible kind.
 rolling_direction (int) – The sign decides, if to shift our cutout window backwards or forwards in “time”. The absolute value decides, how much to shift at each step.
 max_timeshift (int) – If not None, the cutout window is at maximum max_timeshift large. If none, it grows infinitely.
 min_timeshift (int) – Throw away all extracted forecast windows smaller or equal than this. Must be larger than or equal 0.
 n_jobs (int) – The number of processes to use for parallelization. If zero, no parallelization is used.
 chunksize (None or int) – How many shifts per job should be calculated.
 show_warnings (bool) – Show warnings during the feature extraction (needed for debugging of calculators).
 disable_progressbar (bool) – Do not show a progressbar while doing the calculation.
 distributor (class) – Advanced parameter: set this to a class name that you want to use as a distributor. See the utilities/distribution.py for more information. Leave to None, if you want TSFresh to choose the best distributor.
Returns: The rolled data frame or dictionary of data frames
Return type: the one from df_or_dict
tsfresh.utilities.distribution module¶
This module contains the Distributor class, such objects are used to distribute the calculation of features. Essentially, a Distributor organizes the application of feature calculators to data chunks.
Design of this module by Nils Braun

class
tsfresh.utilities.distribution.
ClusterDaskDistributor
(address)[source]¶ Bases:
tsfresh.utilities.distribution.IterableDistributorBaseClass
Distributor using a dask cluster, meaning that the calculation is spread over a cluster

calculate_best_chunk_size
(data_length)[source]¶ Uses the number of dask workers in the cluster (during execution time, meaning when you start the extraction) to find the optimal chunk_size.
Parameters: data_length (int) – A length which defines how many calculations there need to be.

distribute
(func, partitioned_chunks, kwargs)[source]¶ Calculates the features in a parallel fashion by distributing the map command to the dask workers on a cluster
Parameters:  func (callable) – the function to send to each worker.
 partitioned_chunks (iterable) – The list of data chunks  each element is again a list of chunks  and should be processed by one worker.
 kwargs (dict of string to parameter) – parameters for the map function
Returns: The result of the calculation as a list  each item should be the result of the application of func to a single element.


class
tsfresh.utilities.distribution.
DistributorBaseClass
[source]¶ Bases:
object
The distributor abstract base class.
The main purpose of the instances of the DistributorBaseClass subclasses is to evaluate a function (called map_function) on a list of data items (called data).
Dependent on the implementation of the distribute function, this is done in parallel or using a cluster of nodes.

map_reduce
(map_function, data, function_kwargs=None, chunk_size=None, data_length=None)[source]¶ This method contains the core functionality of the DistributorBaseClass class.
It maps the map_function to each element of the data and reduces the results to return a flattened list.
It needs to be implemented for each of the subclasses.
Parameters:  map_function (callable) – a function to apply to each data item.
 data (iterable) – the data to use in the calculation
 function_kwargs (dict of string to parameter) – parameters for the map function
 chunk_size (int) – If given, chunk the data according to this size. If not given, use an empirical value.
 data_length (int) – If the data is a generator, you have to set the length here. If it is none, the length is deduced from the len of the data.
Returns: the calculated results
Return type:


class
tsfresh.utilities.distribution.
IterableDistributorBaseClass
[source]¶ Bases:
tsfresh.utilities.distribution.DistributorBaseClass
Distributor Base Class that can handle all iterable items and calculate a map_function on each item separately.
This is done on chunks of the data, meaning, that the DistributorBaseClass classes will chunk the data into chunks, distribute the data and apply the map_function functions on the items separately.
Dependent on the implementation of the distribute function, this is done in parallel or using a cluster of nodes.

calculate_best_chunk_size
(data_length)[source]¶ Calculates the best chunk size for a list of length data_length. The current implemented formula is more or less an empirical result for multiprocessing case on one machine.
Parameters: data_length (int) – A length which defines how many calculations there need to be. Returns: the calculated chunk size Return type: int TODO: Investigate which is the best chunk size for different settings.

close
()[source]¶ Abstract base function to clean the DistributorBaseClass after use, e.g. close the connection to a DaskScheduler

distribute
(func, partitioned_chunks, kwargs)[source]¶ This abstract base function distributes the work among workers, which can be threads or nodes in a cluster. Must be implemented in the derived classes.
Parameters:  func (callable) – the function to send to each worker.
 partitioned_chunks (iterable) – The list of data chunks  each element is again a list of chunks  and should be processed by one worker.
 kwargs (dict of string to parameter) – parameters for the map function
Returns: The result of the calculation as a list  each item should be the result of the application of func to a single element.

map_reduce
(map_function, data, function_kwargs=None, chunk_size=None, data_length=None)[source]¶ This method contains the core functionality of the DistributorBaseClass class.
It maps the map_function to each element of the data and reduces the results to return a flattened list.
How the jobs are calculated, is determined by the classes
tsfresh.utilities.distribution.DistributorBaseClass.distribute()
method, which can distribute the jobs in multiple threads, across multiple processing units etc.To not transport each element of the data individually, the data is split into chunks, according to the chunk size (or an empirical guess if none is given). By this, worker processes not tiny but adequate sized parts of the data.
Parameters:  map_function (callable) – a function to apply to each data item.
 data (iterable) – the data to use in the calculation
 function_kwargs (dict of string to parameter) – parameters for the map function
 chunk_size (int) – If given, chunk the data according to this size. If not given, use an empirical value.
 data_length (int) – If the data is a generator, you have to set the length here. If it is none, the length is deduced from the len of the data.
Returns: the calculated results
Return type:

static
partition
(data, chunk_size)[source]¶ This generator partitions an iterable into slices of length chunk_size. If the chunk size is not a divider of the data length, the last slice will be shorter.
The important part here is, that the iterable is only traversed once and the chunks are produced one at a time. This is good for both memory as well as speed.
Parameters:  data (Iterable) – The data to partition.
 chunk_size (int) – The chunk size. The last chunk might be smaller.
Returns: A generator producing the chunks of data.
Return type: Generator[Iterable]


class
tsfresh.utilities.distribution.
LocalDaskDistributor
(n_workers)[source]¶ Bases:
tsfresh.utilities.distribution.IterableDistributorBaseClass
Distributor using a local dask cluster and inproc communication.

distribute
(func, partitioned_chunks, kwargs)[source]¶ Calculates the features in a parallel fashion by distributing the map command to the dask workers on a local machine
Parameters:  func (callable) – the function to send to each worker.
 partitioned_chunks (iterable) – The list of data chunks  each element is again a list of chunks  and should be processed by one worker.
 kwargs (dict of string to parameter) – parameters for the map function
Returns: The result of the calculation as a list  each item should be the result of the application of func to a single element.


class
tsfresh.utilities.distribution.
MapDistributor
(disable_progressbar=False, progressbar_title='Feature Extraction')[source]¶ Bases:
tsfresh.utilities.distribution.IterableDistributorBaseClass
Distributor using the python buildin map, which calculates each job sequentially one after the other.

calculate_best_chunk_size
(data_length)[source]¶ For the map command, which calculates the features sequentially, a the chunk_size of 1 will be used.
Parameters: data_length (int) – A length which defines how many calculations there need to be.

distribute
(func, partitioned_chunks, kwargs)[source]¶ Calculates the features in a sequential fashion by pythons map command
Parameters:  func (callable) – the function to send to each worker.
 partitioned_chunks (iterable) – The list of data chunks  each element is again a list of chunks  and should be processed by one worker.
 kwargs (dict of string to parameter) – parameters for the map function
Returns: The result of the calculation as a list  each item should be the result of the application of func to a single element.


class
tsfresh.utilities.distribution.
MultiprocessingDistributor
(n_workers, disable_progressbar=False, progressbar_title='Feature Extraction', show_warnings=True)[source]¶ Bases:
tsfresh.utilities.distribution.IterableDistributorBaseClass
Distributor using a multiprocessing Pool to calculate the jobs in parallel on the local machine.

distribute
(func, partitioned_chunks, kwargs)[source]¶ Calculates the features in a parallel fashion by distributing the map command to a thread pool
Parameters:  func (callable) – the function to send to each worker.
 partitioned_chunks (iterable) – The list of data chunks  each element is again a list of chunks  and should be processed by one worker.
 kwargs (dict of string to parameter) – parameters for the map function
Returns: The result of the calculation as a list  each item should be the result of the application of func to a single element.


tsfresh.utilities.distribution.
initialize_warnings_in_workers
(show_warnings)[source]¶ Small helper function to initialize warnings module in multiprocessing workers.
On Windows, Python spawns fresh processes which do not inherit from warnings state, so warnings must be enabled/disabled before running computations.
Parameters: show_warnings (bool) – whether to show warnings or not.
tsfresh.utilities.profiling module¶
Contains methods to start and stop the profiler that checks the runtime of the different feature calculators

tsfresh.utilities.profiling.
end_profiling
(profiler, filename, sorting=None)[source]¶ Helper function to stop the profiling process and write out the profiled data into the given filename. Before this, sort the stats by the passed sorting.
Parameters:  profiler (cProfile.Profile) – An already started profiler (probably by start_profiling).
 filename (basestring) – The name of the output file to save the profile.
 sorting (basestring) – The sorting of the statistics passed to the sort_stats function.
Returns: None
Return type: Start and stop the profiler with:
>>> profiler = start_profiling() >>> # Do something you want to profile >>> end_profiling(profiler, "out.txt", "cumulative")

tsfresh.utilities.profiling.
start_profiling
()[source]¶ Helper function to start the profiling process and return the profiler (to close it later).
Returns: a started profiler. Return type: cProfile.Profile Start and stop the profiler with:
>>> profiler = start_profiling() >>> # Do something you want to profile >>> end_profiling(profiler, "cumulative", "out.txt")
tsfresh.utilities.string_manipulation module¶

tsfresh.utilities.string_manipulation.
convert_to_output_format
(param)[source]¶ Helper function to convert parameters to a valid string, that can be used in a column name. Does the opposite which is used in the from_columns function.
The parameters are sorted by their name and written out in the form
<param name>_<param value>__<param name>_<param value>__ …If a <param_value> is a string, this method will wrap it with parenthesis “, so “<param_value>”
Parameters: param (dict) – The dictionary of parameters to write out Returns: The string of parsed parameters Return type: str

tsfresh.utilities.string_manipulation.
get_config_from_string
(parts)[source]¶ Helper function to extract the configuration of a certain function from the column name. The column name parts (split by “__”) should be passed to this function. It will skip the kind name and the function name and only use the parameter parts. These parts will be split up on “_” into the parameter name and the parameter value. This value is transformed into a python object (for example is “(1, 2, 3)” transformed into a tuple consisting of the ints 1, 2 and 3).
Returns None of no parameters are in the column name.
Parameters: parts (list) – The column name split up on “__” Returns: a dictionary with all parameters, which are encoded in the column name. Return type: dict