Skip to main content
Version: v1.0.0


module ml.dataframes

The dataframes module provides a lot of common operations for dataframe handling

function shuffle#

shuffle(df: DataFrame) โ†’ DataFrame

Shuffles the DataFrame and returns it


  • `df` (pd.DataFrame): The DataFrame that should have its records shuffled


  • `pd.DataFrame`: The DataFrame that is shuffled

function one_hot_encode#

one_hot_encode(    df: DataFrame,    column_name: str,    drop_column: bool = True,    prefix: str = None) โ†’ DataFrame

Take a categorical column and pivots the DataFrame to add columns (0 or 1 value) for every category


  • `df` (pd.DataFrame): The DataFrame that contains the column to be encoded
  • `column_name` (str): The name of the column that contains the categorical values
  • `drop_column` (bool): Will remove the original column from the dataframe
  • `prefix` (str): The prefix of the new columns. By default the original column name will be taken


  • `pd.DataFrame`: The DataFrame with the one hot encoded features

function keep_numeric_features#

keep_numeric_features(df: DataFrame) โ†’ DataFrame

Takes the DataFrame and removes all non-numeric columns or features


  • `df` (pd.DataFrame): The DataFrame that should have its non-numerics removed


  • `pd.DataFrame`: The DataFrame with only the numeric features

function plot_features#

plot_features(    df: DataFrame,    column_names: <built-in function array> = None,    grid_shape=None,    fig_size=None)

Plots the distribution of the relevant columns of a DataFrame


  • `df` (pd.DataFrame): The DataFrame that should have its non-numerics removed
  • `column_names` (np.array): The columns that should be plotted. If None, all numeric columns will be taken
  • `grid_shape` (int, int): The shape of the plotting grid (rows, cols). If None, the grid will have maximum 5 columns
  • `fig_size` (int, int): The size of the full plotting grid. If None, auto size will be applied


  • `figure, axes (tuple)`: The figure of the plot and the axes of the plot will be returned for further tuning where needed

function to_timeseries#

to_timeseries(df: DataFrame, time_column: str) โ†’ DataFrame

This is deprecated and it is advised to use the timeseries.set_timeseries function for this

function distribute_class#

distribute_class(    df: DataFrame,    class_column: str,    class_size: int = None,    shuffle_result: bool = True)

Makes sure a DataFrame is returned with an equal class distribution For every class a number of samples will be taken The class size is defined by the minimum of the passed class_size parameter and the smallest class in the Dataframe


  • `df` (pd.DataFrame): the DataFrame that contains all records
  • `class_column` (str): the name of the column that contains the class feature
  • `class_size` (int): the size of the class. defaults to the minimum available class size
  • `shuffle_result` (bool): indicates the DataFrame should be shuffled before returning. Default to True


  • `pd.DataFrame`: the DataFrame that contains the records with the equal class distribution

This file was automatically generated via lazydocs.