17
loading...
This website collects cookies to deliver better user experience
POSITION | NAME | SCORE | ENTRIES |
---|---|---|---|
20 | jefferson1100001 | 0.895456136076574 | 7 |
TITLE | LABEL_QUALITY | LANGUAGE | CATEGORY |
---|---|---|---|
Hidrolavadora Lavor One 120 Bar 1700w Bomba A... | unreliable | spanish | ELECTRIC_PRESSURE_WASHERS |
Placa De Sonido - Behringer Umc22 | unreliable | spanish | SOUND_CARDS |
Maquina De Lavar Electrolux 12 Kilos | unreliable | portuguese | WASHING_MACHINES |
Remove tildes
Spanish and Portuguese words have tildes, like teléfono. This step mutates the word to telefono.
Remove word separators
Some titles have dash, dots and other punctuation marks without a space between them, for example kit.ruedas.moto.
Remove other punctuation marks and numbers
I've removed any other punctuation mark and numbers, but the number must be surrounded by a word boundary.
Tokenize the title
I applied the WordPunctTokenizer provided by NLTK to split each title into words.
Remove stop words
On the resulting array of words I discarded stop words like: "un", "unas", "unos"...
Stem each token
I used the SnowballStemmer provided by NLTK. Stemming is the process of reducing inflected (or sometimes derived) words to their word stem. For example: Cámara is transformed to "cam".
BEFORE | AFTER PREPROCESSING |
---|---|
Placa De Sonido - Behringer Umc22 | plac son behring umc22 |
Oportunidad! Notebook Dell I3 - 4gb Ddr4 - Hd 1tb - Win 10 | oportun notebook dell i3 4gb ddr4 hd 1tb win |
Cámara Instantánea Fujifilm Instax Mini 9 - Azul Cobalto | cam instantane fujifilm instax mini azul cobalt |
{'kit': 785233, 'original': 469537, 'pret': 232647, 'led': 220194, ...}
['porton', 'chap', 'hoj', 'mtr', 'marc']
[120, 121, 122, 123, 124]
[120, 121, 122, 123, 124, 0, 0, 0, 0, 0, 0, 0, ....]
Seed random numbers, so you can get reproducible results.
Use stratified samples when splitting test and train
It means that each set must have the same proportion of classes.
Take 1% for testing
The dataset is relative big, 1% seems to represent a good number of features of the dataset for being validated.
Use class_weights
Due to the imbalanced nature of the dataset, class_weights increased the BACC of the model.
Explore the data locally
And maybe preprocess locally but use multiprocessing
Use Colab or Kaggle
To take adventage of the GPU and train faster
"Buzo Harry Potter Lentes Cicatriz Hogwarts Hoodie"
SWEATSHIRTS_AND_HOODIES