Normalize columns of pandas data frame
Sophia Terry
I have a dataframe in pandas where each column has different value range. For example:
df:
A B C
1000 10 0.5
765 5 0.35
800 7 0.09Any idea how I can normalize the columns of this dataframe where each value is between 0 and 1?
My desired output is:
A B C
1 1 1
0.765 0.5 0.7
0.8 0.7 0.18(which is 0.09/0.5) 8 22 Answers
one easy way by using Pandas: (here I want to use mean normalization)
normalized_df=(df-df.mean())/df.std()to use min-max normalization:
normalized_df=(df-df.min())/(df.max()-df.min())Edit: To address some concerns, need to say that Pandas automatically applies colomn-wise function in the code above.
8You can use the package sklearn and its associated preprocessing utilities to normalize the data.
import pandas as pd
from sklearn import preprocessing
x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)For more information look at the scikit-learn documentation on preprocessing data: scaling features to a range.
8Based on this post:
You can do the following:
def normalize(df): result = df.copy() for feature_name in df.columns: max_value = df[feature_name].max() min_value = df[feature_name].min() result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value) return resultYou don't need to stay worrying about whether your values are negative or positive. And the values should be nicely spread out between 0 and 1.
3Detailed Example of Normalization Methods
- Pandas normalization (unbiased)
- Sklearn normalization (biased)
- Does biased-vs-unbiased affect Machine Learning?
- Mix-max scaling
References:Wikipedia: Unbiased Estimation of Standard Deviation
Example Data
import pandas as pd
df = pd.DataFrame({ 'A':[1,2,3], 'B':[100,300,500], 'C':list('abc') })
print(df) A B C
0 1 100 a
1 2 300 b
2 3 500 cNormalization using pandas (Gives unbiased estimates)
When normalizing we simply subtract the mean and divide by standard deviation.
df.iloc[:,0:-1] = df.iloc[:,0:-1].apply(lambda x: (x-x.mean())/ x.std(), axis=0)
print(df) A B C
0 -1.0 -1.0 a
1 0.0 0.0 b
2 1.0 1.0 cNormalization using sklearn (Gives biased estimates, different from pandas)
If you do the same thing with sklearn you will get DIFFERENT output!
import pandas as pd
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df = pd.DataFrame({ 'A':[1,2,3], 'B':[100,300,500], 'C':list('abc') })
df.iloc[:,0:-1] = scaler.fit_transform(df.iloc[:,0:-1].to_numpy())
print(df) A B C
0 -1.224745 -1.224745 a
1 0.000000 0.000000 b
2 1.224745 1.224745 cDoes Biased estimates of sklearn makes Machine Learning Less Powerful?
NO.
The official documentation of sklearn.preprocessing.scale states that using biased estimator is UNLIKELY to affect the performance of machine learning algorithms and we can safely use them.
From official documentation:
We use a biased estimator for the standard deviation, equivalent to
numpy.std(x, ddof=0). Note that the choice ofddofis unlikely to affect model performance.
What about MinMax Scaling?
There is no Standard Deviation calculation in MinMax scaling. So the result is same in both pandas and scikit-learn.
import pandas as pd
df = pd.DataFrame({ 'A':[1,2,3], 'B':[100,300,500], })
(df - df.min()) / (df.max() - df.min()) A B
0 0.0 0.0
1 0.5 0.5
2 1.0 1.0
# Using sklearn
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
arr_scaled = scaler.fit_transform(df)
print(arr_scaled)
[[0. 0. ] [0.5 0.5] [1. 1. ]]
df_scaled = pd.DataFrame(arr_scaled, columns=df.columns,index=df.index)
print(df_scaled) A B
0 0.0 0.0
1 0.5 0.5
2 1.0 1.0 Your problem is actually a simple transform acting on the columns:
def f(s): return s/s.max()
frame.apply(f, axis=0)Or even more terse:
frame.apply(lambda x: x/x.max(), axis=0) 5 If you like using the sklearn package, you can keep the column and index names by using pandas loc like so:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_values = scaler.fit_transform(df)
df.loc[:,:] = scaled_values Take care with this answer, as it ONLY works for data that ranges [0, n]. This does not work for any range of data.
Simple is Beautiful:
df["A"] = df["A"] / df["A"].max()
df["B"] = df["B"] / df["B"].max()
df["C"] = df["C"] / df["C"].max() 5 You can create a list of columns that you want to normalize
column_names_to_normalize = ['A', 'E', 'G', 'sadasdsd', 'lol']
x = df[column_names_to_normalize].values
x_scaled = min_max_scaler.fit_transform(x)
df_temp = pd.DataFrame(x_scaled, columns=column_names_to_normalize, index = df.index)
df[column_names_to_normalize] = df_tempYour Pandas Dataframe is now normalized only at the columns you want
However, if you want the opposite, select a list of columns that you DON'T want to normalize, you can simply create a list of all columns and remove that non desired ones
column_names_to_not_normalize = ['B', 'J', 'K']
column_names_to_normalize = [x for x in list(df) if x not in column_names_to_not_normalize ] I think that a better way to do that in pandas is just
df = df/df.max().astype(np.float64)Edit If in your data frame negative numbers are present you should use instead
df = df/df.loc[df.abs().idxmax()].astype(np.float64) 3 The solution given by Sandman and Praveen is very well. The only problem with that if you have categorical variables in other columns of your data frame this method will need some adjustments.
My solution to this type of issue is following:
from sklearn import preprocesing x = pd.concat([df.Numerical1, df.Numerical2,df.Numerical3]) min_max_scaler = preprocessing.MinMaxScaler() x_scaled = min_max_scaler.fit_transform(x) x_new = pd.DataFrame(x_scaled) df = pd.concat([df.Categoricals,x_new]) 1 You might want to have some of columns being normalized and the others be unchanged like some of regression tasks which data labels or categorical columns are unchanged So I suggest you this pythonic way (It's a combination of @shg and @Cina answers ):
features_to_normalize = ['A', 'B', 'C']
# could be ['A','B']
df[features_to_normalize] = df[features_to_normalize].apply(lambda x:(x-x.min()) / (x.max()-x.min())) df_normalized = df / df.max(axis=0) It is only simple mathematics. The answer should as simple as below.
normed_df = (df - df.min()) / (df.max() - df.min()) This is how you do it column-wise using list comprehension:
[df[col].update((df[col] - df[col].min()) / (df[col].max() - df[col].min())) for col in df.columns] You can simply use the pandas.DataFrame.transform1 function in this way:
df.transform(lambda x: x/x.max()) 2 def normalize(x): try: x = x/np.linalg.norm(x,ord=1) return x except : raise
data = pd.DataFrame.apply(data,normalize)From the document of pandas,DataFrame structure can apply an operation (function) to itself .
DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)Applies function along input axis of DataFrame. Objects passed to functions are Series objects having index either the DataFrame’s index (axis=0) or the columns (axis=1). Return type depends on whether passed function aggregates, or the reduce argument if the DataFrame is empty.
You can apply a custom function to operate the DataFrame .
1The following function calculates the Z score:
def standardization(dataset): """ Standardization of numeric fields, where all values will have mean of zero and standard deviation of one. (z-score) Args: dataset: A `Pandas.Dataframe` """ dtypes = list(zip(dataset.dtypes.index, map(str, dataset.dtypes))) # Normalize numeric columns. for column, dtype in dtypes: if dtype == 'float32': dataset[column] -= dataset[column].mean() dataset[column] /= dataset[column].std() return dataset You can use minmax_scale to transform each column to a scale from 0-1.
Normalize all columns
from sklearn.preprocessing import minmax_scale
df[:] = minmax_scale(df)Normalize single column
from sklearn.preprocessing import minmax_scale
df['a'] = minmax_scale(df['a'])Normalize only numerical columns:
import numpy as np
from sklearn.preprocessing import minmax_scale
cols = df.select_dtypes(np.number).columns
df[cols] = minmax_scale(df[cols])Full example:
# Prep
import pandas as pd
import numpy as np
from sklearn.preprocessing import minmax_scale
# Sample data
df = pd.DataFrame({'a':[0,1,2], 'b':[-10,-30,-50], 'c':['x', 'y', 'z']})
# MinMax normalize all numeric columns
cols = df.select_dtypes(np.number).columns
df[cols] = minmax_scale(df[cols])
# Result
print(df)
# a b c
# 0 0.0 1.0 x
# 2 0.5 0.5 y
# 3 1.0 0.0 zNote: Keeps index, column names or non-numerical variables unchanged. Function is applied for each column.
More info on standardization and normalization:
You can do this in one line
DF_test = DF_test.sub(DF_test.mean(axis=0), axis=1)/DF_test.mean(axis=0)it takes mean for each of the column and then subtracts it(mean) from every row(mean of particular column subtracts from its row only) and divide by mean only. Finally, we what we get is the normalized data set.
Pandas does column wise normalization by default. Try the code below.
X= pd.read_csv('.\\data.csv')
X = (X-X.min())/(X.max()-X.min())The output values will be in range of 0 and 1.
Hey use the apply function with lambda which speeds up the process:
def normalize(df_col): # Condition to exclude 'ID' and 'Class' feature if (str(df_col.name) != str('ID') and str(df_col.name)!=str('Class')): max_value = df_col.max() min_value = df_col.min() #It avoids NaN and return 0 instead if max_value == min_value: return 0 sub_value = max_value - min_value return np.divide(np.subtract(df_col,min_value),sub_value) else: return df_col df_normalize = df.apply(lambda x :normalize(x)) If your data is positively skewed, the best way to normalize is to use the log transformation:
df = np.log10(df) 0