Pandas DataFrame sort by categorical column but by specific class ordering
Andrew Henderson
I would like to select the top entries in a Pandas dataframe base on the entries of a specific column by using df_selected = df_targets.head(N).
Each entry has a target value (by order of importance):
Likely Supporter, GOTV, Persuasion, Persuasion+GOTV Unfortunately if I do
df_targets = df_targets.sort("target")the ordering will be alphabetical (GOTV,Likely Supporter, ...).
I was hoping for a keyword like list_ordering as in:
my_list = ["Likely Supporter", "GOTV", "Persuasion", "Persuasion+GOTV"]
df_targets = df_targets.sort("target", list_ordering=my_list)To deal with this issue I create a dictionary:
dict_targets = OrderedDict()
dict_targets["Likely Supporter"] = "0 Likely Supporter"
dict_targets["GOTV"] = "1 GOTV"
dict_targets["Persuasion"] = "2 Persuasion"
dict_targets["Persuasion+GOTV"] = "3 Persuasion+GOTV", but it seems like a non-pythonic approach.
Suggestions would be much appreciated!
34 Answers
I think you need Categorical with parameter ordered=True and then sorting by sort_values works very nice:
Check documentation for Categorical:
Ordered Categoricals can be sorted according to the custom order of the categories and can have a min and max value.
import pandas as pd
df = pd.DataFrame({'a': ['GOTV', 'Persuasion', 'Likely Supporter', 'GOTV', 'Persuasion', 'Persuasion+GOTV']})
df.a = pd.Categorical(df.a, categories=["Likely Supporter","GOTV","Persuasion","Persuasion+GOTV"], ordered=True)
print (df) a
0 GOTV
1 Persuasion
2 Likely Supporter
3 GOTV
4 Persuasion
5 Persuasion+GOTV
print (df.a)
0 GOTV
1 Persuasion
2 Likely Supporter
3 GOTV
4 Persuasion
5 Persuasion+GOTV
Name: a, dtype: category
Categories (4, object): [Likely Supporter < GOTV < Persuasion < Persuasion+GOTV]df.sort_values('a', inplace=True)
print (df) a
2 Likely Supporter
0 GOTV
3 GOTV
1 Persuasion
4 Persuasion
5 Persuasion+GOTV 1 The method shown in my previous answer is now deprecated.
In stead it is best to use pandas.Categorical as shown here.
So:
list_ordering = ["Likely Supporter","GOTV","Persuasion","Persuasion+GOTV"]
df["target"] = pd.Categorical(df["target"], categories=list_ordering) I guess this is the most sufficient one, to prefer in case you face certain situation: This is your preferred ordering...
my_order = ["Likely Supporter", "GOTV", "Persuasion", "Persuasion+GOTV"]So, just do...
df['Column_to_update'].cat.reorder_categories(my_order, inplace= True)It is flexible and no need to assign new category. But... Your column must be dtype = 'category' otherwise it will not work.
Read more here (Pandas documentation)
Thanks to jerzrael's input and references,
I like this sliced solution:
list_ordering = ["Likely Supporter","GOTV","Persuasion","Persuasion+GOTV"]
df["target"] = df["target"].astype("category", categories=list_ordering, ordered=True) 2