28
loading...
This website collects cookies to deliver better user experience
gender
can be categorized as 'male' or 'female'. In model training, we have to convert these features to numerical values. Consider the following example with 3 features:gender
: ['female', 'male']region
: ['Africa', 'Asia', 'Europe', 'US']class
: ['A', 'B', 'C'']Feature_1 | Feature_2 | Feature_3 | |
---|---|---|---|
Sample_1 | 0 | 3 | 2 |
Sample_2 | 1 | 2 | 1 |
Sample_3 | 0 | 1 | 1 |
Sample_4 | 1 | 0 | 0 |
feature_1
: perhaps male/female, and it's represented as 0-male / 1-female. But how can we convert these numbers into one-hot encoding?Feature | Encoding |
---|---|
Feature_1 |
0->01, 1->10 |
Feature_2 |
0->0001, 1->0010, 2->0100, 3->1000 |
Feature_3 |
0->001, 1->010, 2->100 |
Feature_1 | Feature_2 | Feature_3 | |
---|---|---|---|
Sample_1 | 01 | 1000 | 100 |
Sample_2 | 10 | 0100 | 010 |
Sample_3 | 01 | 0010 | 010 |
Sample_4 | 10 | 0001 | 001 |
Feature | Vector |
---|---|
Sample_1 | [0,1,1,0,0,0,1,0,0] |
Sample_2 | [1,0,0,1,0,0,0,1,0] |
Sample_3 | [0,1,0,0,1,0,0,1,0] |
Sample_4 | [1,0,0,0,0,1,0,0,1] |
pandas.get_dummies
is used to convert categorical variables into one-hot encoding.pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
import pandas as pd
df = pd.DataFrame([
['male', 'Africa', 'C'],
['female', 'Asia', 'B'],
['male', 'Europe', 'B'],
['female', 'US', 'A']
])
df.columns = ['gender', 'region', 'class']
pd.get_dummies(df)
gender | region | class | |
---|---|---|---|
Sample_1 | male | Africa | C |
Sample_2 | female | Asia | B |
Sample_3 | male | Europe | B |
Sample_4 | female | US | A |
gender_male | gender_female | region_Africa | region_Asia | region_Europe | region_US | class_A | class_B | class_C | |
---|---|---|---|---|---|---|---|---|---|
Sample_1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
Sample_2 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
Sample_3 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
Sample_4 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
sklearn.preprocessing.OneHotEncoder
is used to encode categorical features as a one-hot numeric array.class sklearn.preprocessing.OneHotEncoder(*, categories='auto', drop=None, sparse=True, dtype=<class 'numpy.float64'>, handle_unknown='error')
from sklearn import preprocessing
enc = preprocessing.OneHotEncoder()
enc.fit([[0, 3, 2], [1, 2, 1], [0, 1, 1], [1, 0, 0]])
array = enc.transform([[0, 3, 2]]).toarray()
print(array)
[[1. 0. 0. 0. 0. 1. 0. 0. 1.]]