ml.dataframes
module ml.dataframes
The dataframes module provides a lot of common operations for dataframe handling
shuffle
#
function shuffle(df: DataFrame) โ DataFrame
Shuffles the DataFrame and returns it
Args:
- `df` (pd.DataFrame): The DataFrame that should have its records shuffled
Returns:
- `pd.DataFrame`: The DataFrame that is shuffled
one_hot_encode
#
function one_hot_encode( df: DataFrame, column_name: str, drop_column: bool = True, prefix: str = None) โ DataFrame
Take a categorical column and pivots the DataFrame to add columns (0 or 1 value) for every category
Args:
- `df` (pd.DataFrame): The DataFrame that contains the column to be encoded
- `column_name` (str): The name of the column that contains the categorical values
- `drop_column` (bool): Will remove the original column from the dataframe
- `prefix` (str): The prefix of the new columns. By default the original column name will be taken
Returns:
- `pd.DataFrame`: The DataFrame with the one hot encoded features
keep_numeric_features
#
function keep_numeric_features(df: DataFrame) โ DataFrame
Takes the DataFrame and removes all non-numeric columns or features
Args:
- `df` (pd.DataFrame): The DataFrame that should have its non-numerics removed
Returns:
- `pd.DataFrame`: The DataFrame with only the numeric features
plot_features
#
function plot_features( df: DataFrame, column_names: <built-in function array> = None, grid_shape=None, fig_size=None)
Plots the distribution of the relevant columns of a DataFrame
Args:
- `df` (pd.DataFrame): The DataFrame that should have its non-numerics removed
- `column_names` (np.array): The columns that should be plotted. If None, all numeric columns will be taken
- `grid_shape` (int, int): The shape of the plotting grid (rows, cols). If None, the grid will have maximum 5 columns
- `fig_size` (int, int): The size of the full plotting grid. If None, auto size will be applied
Returns:
- `figure, axes (tuple)`: The figure of the plot and the axes of the plot will be returned for further tuning where needed
to_timeseries
#
function to_timeseries(df: DataFrame, time_column: str) โ DataFrame
This is deprecated and it is advised to use the timeseries.set_timeseries function for this
distribute_class
#
function distribute_class( df: DataFrame, class_column: str, class_size: int = None, shuffle_result: bool = True)
Makes sure a DataFrame is returned with an equal class distribution For every class a number of samples will be taken The class size is defined by the minimum of the passed class_size parameter and the smallest class in the Dataframe
Args:
- `df` (pd.DataFrame): the DataFrame that contains all records
- `class_column` (str): the name of the column that contains the class feature
- `class_size` (int): the size of the class. defaults to the minimum available class size
- `shuffle_result` (bool): indicates the DataFrame should be shuffled before returning. Default to True
Returns:
- `pd.DataFrame`: the DataFrame that contains the records with the equal class distribution
This file was automatically generated via lazydocs.