37
loading...
This website collects cookies to deliver better user experience
RecBole seems to be a joint project started by the laboratories of Renmin University of China and Peking University, and it appeared on arxiv in November 2020. In August 2021, the module that we provide reached v1.0, and it seems to be used by various people in earnest.
The most attractive feature of RecBole is that it implements a lot of recommendation models with a unified interface for comparison. The number of models implemented and datasets that can be applied is tremendous. There are currently more than 70 models (model list) and more than 20 datasets (dataset list) that can be tested immediately. You can try instantly.
pip install recbole
python run_recbole.py --model=<your favorite model> --dataset_name ml-100k
That's all. Instantly you can try out over 70 models (some models require additional configuration) against the MovieLens-100k dataset, the most famous benchmark in the recommendation community. There are not many environments where you can try this many models and data. All of the 70+ models in the collection have been carefully reimplemented in PyTorch and are very reliable, and the basic interfaces, such as the predict function, are standardized to make it easy to experiment.
from recbole.quick_start import run_recbole
run_recbole(model={model_name}, dataset="movielens-1m")
# general
gpu_id: 0
use_gpu: True
seed: 2020
state: INFO
reproducibility: True
data_path: 'dataset/'
checkpoint_dir: 'saved/movielens-1m'
show_progress: True
save_dataset: False
save_dataloaders: False
# Atomic File Format
field_separator: "\t"
seq_separator: "@"
# Common Features
USER_ID_FIELD: user_id
ITEM_ID_FIELD: item_id
RATING_FIELD: rating
TIME_FIELD: timestamp
seq_len: ~
# Label for Point-wise DataLoader
LABEL_FIELD: label
# NegSample Prefix for Pair-wise DataLoader
NEG_PREFIX: neg_
# Sequential Model Needed
ITEM_LIST_LENGTH_FIELD: item_length
LIST_SUFFIX: _list
MAX_ITEM_LIST_LENGTH: 50
POSITION_FIELD: position_id
# Knowledge-based Model Needed
HEAD_ENTITY_ID_FIELD: head_id
TAIL_ENTITY_ID_FIELD: tail_id
RELATION_ID_FIELD: relation_id
ENTITY_ID_FIELD: entity_id
# Selectively Loading
load_col:
inter: [user_id, item_id, timestamp, rating]
user: [user_id, age, gender, occupation, zip_code]
item: [item_id, movie_title, release_year, genre]
unused_col:
inter: [timestamp, rating]
# Filtering
rm_dup_inter: ~
val_interval: ~
filter_inter_by_user_or_item: True
user_inter_num_interval: "[1,inf]"
item_inter_num_interval: "[1,inf]"
# Preprocessing
alias_of_user_id: ~
alias_of_item_id: ~
alias_of_entity_id: ~
alias_of_relation_id: ~
preload_weight: ~
normalize_field: ~
normalize_all: True
# Training and evaluation config
epochs: 50
stopping_step: 10
train_batch_size: 4096
eval_batch_size: 4096
neg_sampling:
uniform: 1
eval_args:
group_by: user
order: TO
split: {'RS': [0.8,0.1,0.1]}
mode: full
metrics: ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision']
topk: 10
valid_metric: MRR@10
metric_decimal_place: 4
config/movielens-1m.yml
or something similar and runrun_recbole(model={model_name}, dataset="movielens-1m", config_file_list=["config/movielens-1m.yml"])
seq_separator
, it will be separated with a blank. I think this is fine for model training, but it makes it impossible to correctly pull item names when analyzing topk later, so I put in characters that are not likely to be separated (but I think this is not the right way to do it, since you can just avoid it in another way when pulling item names). I don't think this is the right way to do it.)ITEM_ID_FIELD
, because the column name of item id is different for each data.MAX_ITEM_LIST_LENGTH
, user_inter_num_interval
and item_inter_num_interval
are all settings that cut off data. If you can't run the model because of too many data, set these settings to loosen the model (sometimes there is no way to loosen them).eval_args
, set order
to TO
and group_by
to user
! This will put in the process of groupbying the items by user and arranging them into timeseries. After that, the split
setting comes into effect, and in this case, it creates a train:valid:test at 80%:10%:10%. I think this is the most realistic way to split the data.The number of users: 1084
Average actions of users: 84.04801477377654
The number of items: 38334
Average actions of items: 2.374559778780685
The number of inters: 91024
The sparsity of the dataset: 99.78095038424168%
Name | recall@10 | precision@10 | ndcg@10 | mrr@10 | hit@10 |
---|---|---|---|---|---|
NGCF | 0.0581 | 0.0647 | 0.0813 | 0.1616 | 0.3745 |
LightGCN | 0.0578 | 0.0644 | 0.081 | 0.1594 | 0.3671 |
DGCF | 0.0587 | 0.0633 | 0.0802 | 0.1585 | 0.3608 |
SLIMElastic | 0.0631 | 0.0612 | 0.0801 | 0.1546 | 0.3664 |
BPR | 0.0572 | 0.0638 | 0.0798 | 0.1566 | 0.3618 |
AutoInt | 0.0552 | 0.0635 | 0.0797 | 0.1588 | 0.3591 |
GCMC | 0.0567 | 0.0631 | 0.0793 | 0.1571 | 0.3596 |
AFM | 0.0535 | 0.0637 | 0.0789 | 0.1592 | 0.3652 |
NNCF | 0.0554 | 0.0626 | 0.0788 | 0.1583 | 0.3609 |
NAIS | 0.0592 | 0.0608 | 0.0782 | 0.1528 | 0.3589 |
EASE | 0.0658 | 0.0583 | 0.0779 | 0.1473 | 0.3598 |
DeepFM | 0.054 | 0.0621 | 0.0779 | 0.1579 | 0.3566 |
DCN | 0.0548 | 0.0618 | 0.0775 | 0.1538 | 0.3505 |
WideDeep | 0.0534 | 0.062 | 0.0774 | 0.1564 | 0.356 |
Item2Vec | 0.0591 | 0.0609 | 0.0773 | 0.1477 | 0.3598 |
FM | 0.0553 | 0.0611 | 0.0773 | 0.1557 | 0.3611 |
SpectralCF | 0.0527 | 0.0608 | 0.0768 | 0.1575 | 0.3531 |
RecVAE | 0.0563 | 0.0602 | 0.0765 | 0.1499 | 0.347 |
NeuMF | 0.055 | 0.0606 | 0.0763 | 0.152 | 0.3543 |
FFM | 0.0551 | 0.0612 | 0.076 | 0.1507 | 0.3614 |
xDeepFM | 0.0532 | 0.0599 | 0.0754 | 0.1545 | 0.3551 |
NFM | 0.0515 | 0.0605 | 0.075 | 0.153 | 0.3543 |
DMF | 0.0575 | 0.0582 | 0.0748 | 0.1455 | 0.345 |
PNN | 0.0524 | 0.06 | 0.0745 | 0.1509 | 0.3518 |
ItemKNN | 0.0558 | 0.0549 | 0.0716 | 0.1376 | 0.3243 |
FNN | 0.0476 | 0.058 | 0.0711 | 0.1434 | 0.3296 |
MultiDAE | 0.0513 | 0.0566 | 0.0707 | 0.1403 | 0.3336 |
MacridVAE | 0.0493 | 0.0536 | 0.0666 | 0.1341 | 0.3321 |
CDAE | 0.0384 | 0.0532 | 0.0632 | 0.1293 | 0.2965 |
FwFM | 0.0386 | 0.0532 | 0.063 | 0.1262 | 0.2921 |
LR | 0.0381 | 0.0534 | 0.0628 | 0.1271 | 0.2949 |
Pop | 0.0358 | 0.0494 | 0.0556 | 0.1095 | 0.2891 |
LINE | 0.0253 | 0.0485 | 0.054 | 0.1185 | 0.2609 |
DSSM | 0.0305 | 0.0411 | 0.0483 | 0.104 | 0.2627 |
ENMF | 0.0115 | 0.0176 | 0.0193 | 0.0461 | 0.1442 |
This dataset contains check-ins in NYC and Tokyo collected for about 10 month. Each check-in is associated with its time stamp, its GPS coordinates and its semantic meaning.
The number of users: 1084
Average actions of users: 84.04801477377654
The number of items: 38334
Average actions of items: 2.374559778780685
The number of inters: 91024
The sparsity of the dataset: 99.78095038424168%
# Selectively Loading
load_col:
inter: [user_id, venue_id, timestamp]
unused_col:
inter: [timestamp]
Name | hit@10 | mrr@10 | ndcg@10 | precision@10 | recall@10 |
---|---|---|---|---|---|
LightGCN | 0.2004 | 0.1089 | 0.0401 | 0.0243 | 0.0323 |
SLIMElastic | 0.205 | 0.1071 | 0.0399 | 0.0246 | 0.0332 |
RecVAE | 0.1958 | 0.0979 | 0.0373 | 0.0236 | 0.0316 |
FNN | 0.1884 | 0.098 | 0.0367 | 0.0224 | 0.0303 |
DMF | 0.1911 | 0.0953 | 0.0364 | 0.023 | 0.0307 |
DeepFM | 0.1911 | 0.0931 | 0.0359 | 0.0229 | 0.0312 |
GCMC | 0.1819 | 0.0955 | 0.0359 | 0.0222 | 0.0297 |
MacridVAE | 0.1911 | 0.0934 | 0.0358 | 0.0228 | 0.0306 |
MultiVAE | 0.1745 | 0.0936 | 0.0349 | 0.0211 | 0.0282 |
NeuMF | 0.1791 | 0.0867 | 0.0343 | 0.0223 | 0.0299 |
MultiDAE | 0.1671 | 0.0928 | 0.0341 | 0.0203 | 0.0281 |
xDeepFM | 0.1662 | 0.0948 | 0.034 | 0.0198 | 0.026 |
FwFM | 0.169 | 0.0931 | 0.0335 | 0.0202 | 0.0264 |
WideDeep | 0.1616 | 0.0937 | 0.0335 | 0.0192 | 0.0259 |
LR | 0.1801 | 0.086 | 0.0329 | 0.0219 | 0.0292 |
AutoInt | 0.1644 | 0.0914 | 0.0323 | 0.0189 | 0.0253 |
SpectralCF | 0.1653 | 0.0888 | 0.0322 | 0.0194 | 0.026 |
PNN | 0.1717 | 0.0827 | 0.0319 | 0.0209 | 0.0283 |
AFM | 0.1717 | 0.0802 | 0.0315 | 0.0208 | 0.0281 |
DCN | 0.169 | 0.0839 | 0.0315 | 0.0207 | 0.0271 |
FFM | 0.1634 | 0.0829 | 0.0308 | 0.0191 | 0.0257 |
Pop | 0.1791 | 0.0711 | 0.0301 | 0.0214 | 0.0289 |
FM | 0.1468 | 0.0594 | 0.0243 | 0.0172 | 0.0244 |
BPR | 0.1505 | 0.0557 | 0.0237 | 0.0173 | 0.0238 |
NNCF | 0.1274 | 0.063 | 0.0232 | 0.0144 | 0.0204 |
NFM | 0.0822 | 0.022 | 0.0098 | 0.0084 | 0.0106 |
LINE | 0.0683 | 0.0225 | 0.0094 | 0.0074 | 0.0098 |
NGCF | 0.0572 | 0.023 | 0.0093 | 0.0059 | 0.0086 |
ItemKNN | 0.0406 | 0.0178 | 0.0062 | 0.0043 | 0.0054 |
Item2Vec | 0.0194 | 0.007 | 0.0026 | 0.0021 | 0.0024 |
DSSM | 0.0037 | 0.0015 | 0.0007 | 0.0004 | 0.0007 |
ENMF | 0.0028 | 0.0013 | 0.0005 | 0.0003 | 0.0005 |
CDAE | 0.0009 | 0.0003 | 0.0001 | 0.0001 | 0.0001 |
import json
import click
import numpy as np
import torch
from recbole.config import Config
from recbole.data import create_dataset, data_preparation
from recbole.quick_start.quick_start import load_data_and_model
from recbole.utils import get_model
from recbole.utils.case_study import full_sort_topk
import pandas as pd
from src.custom_models.Item2Vec import Item2Vec
from src.metrics import calculate_indicators
from tqdm.auto import tqdm
@click.command()
@click.option(
"--model_file",
required=True,
type=str,
help="example. saved/ckpd_recipe/Item2Vec-Nov-06-2021_02-41-35.pth",
)
@click.option(
"--output_file",
required=True,
type=str,
help="example. pop.json",
)
@click.option(
"--is_item2vec",
type=bool,
is_flag=True
)
def main(model_file, output_file, is_item2vec):
print("=====")
print(model_file)
print("=====")
# for get Item title
# e.g. foursquare
_df = pd.read_csv("dataset/foursquare-nyc-merged/foursquare-nyc-merged.item", sep="\t")
internal_id_to_title = _df["venue_category_name:token"].to_dict()
# custom model(Item2Vec)
if is_item2vec:
checkpoint = torch.load(model_file)
config = checkpoint["config"]
config.seq_separator = "@"
dataset = create_dataset(config)
train_data, valid_data, test_data = data_preparation(config, dataset)
model = Item2Vec(config, train_data.dataset).to(config["device"])
model.load_state_dict(checkpoint["state_dict"])
model.load_other_parameter(checkpoint.get("other_parameter"))
# when not custom model
else:
config, model, dataset, train_data, valid_data, test_data = load_data_and_model(
model_file=model_file
)
ground_list = []
uid_list = []
for batch_idx, batched_data in enumerate(test_data):
interaction, row_idx, positive_u, positive_i = batched_data
ground_list.append([int(v) for v in positive_i.numpy().tolist()])
uid_list.append(interaction.user_id.numpy()[0])
ranked_list = []
for uid in tqdm(uid_list):
topk_score, topk_iid_list = full_sort_topk(
[uid], model, test_data, k=10, device="cuda"
)
ranked_list += topk_iid_list.cpu()
all_metrics_results = {}
for uid, g_list, r_list in zip(uid_list, ground_list, ranked_list):
external_uid = dataset.id2token(dataset.uid_field, uid)
all_metrics_results[external_uid] = {
"ground_list_id": [v for v in dataset.id2token(dataset.iid_field, g_list)],
"predict_list_id": [v for v in dataset.id2token(dataset.iid_field, r_list)],
"ground_list": [internal_id_to_title[v-1] for v in g_list],
"predict_list": [internal_id_to_title[v-1] for v in r_list.numpy()],
}
text = json.dumps(all_metrics_results, sort_keys=True, ensure_ascii=False, indent=2)
with open(output_file, "w") as fh:
fh.write(text)
if __name__ == "__main__":
main()
from recbole.utils.case_study import
The function full_sort_topk allows for topk output, so I used in this script.recbole.utils.case_study
suggests that topk output is not the main purpose of RecBole.)import glob
import itertools
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
def main():
filelist = glob.glob("output/foursquare_nyc_case_study/*.json")
model_results = {}
for file in tqdm(filelist):
_model = file.split("/")[-1].split(".")[0]
try:
_df = pd.read_json(file).T
model_results[_model] = _df
except:
print(f"{_model} read is failed")
_models = model_results.keys()
combis = list(itertools.combinations(_models, 2))
model_similarities = []
for c in tqdm(combis):
model1 = c[0]
model2 = c[1]
model1_result = model_results[model1]
model2_result = model_results[model2]
model1_predict_list = model1_result["predict_list_id"].values
model2_predict_list = model2_result["predict_list_id"].values
sims = []
for m1_preds, m2_preds in zip(model1_predict_list, model2_predict_list):
_sim = len(set(m1_preds) & set(m2_preds)) / len(m1_preds)
sims.append(_sim)
similarity = np.mean(sims)
model_similarities.append([model1, model2, similarity])
result = pd.DataFrame(
model_similarities, columns=["source_model", "dest_model", "similarity"]
)
result.to_csv("foursquare_nyc_survey_with_recbole.csv", index=False)
if __name__ == "__main__":
main()