BayesianDataFusion.jl package
This gives reference and examples for BayesianDataFusion.jl.
Features
BayesianDataFusion.jl
provides parallel and highly optimized implementation for
- Bayesian Probabilistic Matrix Factorization (BPMF)
- Bayesian Probabilistic Tensor Factorization (BPTF)
- Macau - Bayesian Multi-relational Factorization with Side Information
These methods allow to predict unobserved values in the matrices (or tensors). Since they are all Bayesian methods we can also measure the uncertainty of the predictions. BPMF and BPTF are special cases of Macau. Macau adds
- use of entity side information to improve factorization (e.g, user and/or movie features for factorizing movie ratings)
- use of relation side information to improve factorization (e.g., data about when user went to see particular movie)
- factorization of several matrices (and tensors) for an entity simultaneously.
- Macau can handle high dimensional side-information, e.g., 1,000,000-dimensional user features.
Installation
Inside Julia:
Pkg.clone("https://github.com/jaak-s/BayesianDataFusion.jl.git")
Examples
Next we give simple examples of using Macau for movie ratings prediction from MovieLens data, which is included in the BayesianDataFusion package.
MovieLens
We will use macau
function to factorize (incompletely observed) matrix of movie ratings with side information for both users and movies. The side information contains basic features about users like age group and gender and genre information for movies. To run the example first install Julia library for reading matlab files
Pkg.add("MAT")
Example code
using BayesianDataFusion
using MAT
## load and setup data
pkgdir = Pkg.dir("BayesianDataFusion")
data = matread("$pkgdir/data/movielens_1m.mat")
## setup entities, assigning side information through optional argument F
users = Entity("users", F=data["Fu"]);
movies = Entity("movies", F=data["Fv"]);
## setup the relation between users and movies, data from sparse matrix data["X"]
## first element in '[users, movies]' corresponds to rows and second to columns of data["X"]
ratings = Relation(data["X"], "ratings", [users, movies], class_cut = 2.5);
## assign 500,000 of the observed ratings randomly to the test set
assignToTest!(ratings, 500_000)
## precision of the ratings to 1.5 (i.e., variance of 1/1.5)
setPrecision!(ratings, 1.5)
## the model (with only one relation)
RD = RelationData(ratings)
## run Gibbs sampler of Macau with 10 latent dimensions, total of 100 burnin and 400 posterior samples
result = macau(RD, burnin=100, psamples=400, clamp=[1.0, 5.0], num_latent=10)
This model has only a single relation ratings
between entities users
and movies
.
We use precision 1.5, which is known to be a good estimate of movie rating noise.
The optional parameter clamp=[1.0, 5.0]
to macau
thresholds the predictions to be between 1.0 and 5.0.
To build a model with larger latent dimension use, for example, num_latent=30
.
Macau output shows the progress of the Gibbs sampler:
1: Acc=0.836 ROC=0.500 RMSE=1.118 | use[U: 3.1 β:0.04 λ=21.] mov[U: 3.1 β:0.07 λ=10.] | rati[α=1.5] [4s]
2: Acc=0.836 ROC=0.500 RMSE=1.118 | use[U: 4.4 β:0.03 λ=60.] mov[U: 4.4 β:0.04 λ=38.] | rati[α=1.5] [0s]
...
80: Acc=0.864 ROC=0.829 RMSE=0.889 | use[U: 72.7 β:1.34 λ=4.5] mov[U:122.6 β:3.25 λ=3.1] | rati[α=1.5] [0s]
81: Acc=0.864 ROC=0.829 RMSE=0.888 | use[U: 73.0 β:1.39 λ=5.3] mov[U:123.0 β:3.32 λ=3.2] | rati[α=1.5] [0s]
...
The Acc/ROC/RMSE are computed on the test ratings. Note the optional argument class_cut = 2.5
, used for creating a Relation
, defines the class boundary for computing accuracy (Acc) and AUC-ROC (ROC) values.
An example result of the run is:
Dict{AbstractString,Any} with 10 entries:
"latent_multi_threading" => true
"psamples" => 400
"lambda_beta" => 14.506891192240246
"RMSE" => 0.8526296036293598
"train_counts" => 500000x2 DataFrames.DataFrame…
"predictions" => 500000x5 DataFrames.DataFrame…
"burnin" => 100
"num_latent" => 10
"accuracy" => 0.870428
"ROC" => 0.8485116969174835
where result["predictions"]
gives predicted values and their standard deviation for the values in the test set. The result
also contains ROC, RMSE and accuracy values for the test set.
MovieLens w/o side-information
The above example used user and move features. You can easily factorize the ratings without them, which would correspond to classic BPMF method. Here is an example code
using BayesianDataFusion
using MAT
## load and setup data
pkgdir = Pkg.dir("BayesianDataFusion")
data = matread("$pkgdir/data/movielens_1m.mat")
## setup entities, no features (F):
users = Entity("users");
movies = Entity("movies");
## setup the relation between users and movies, data from sparse matrix data["X"]
## first element in '[users, movies]' corresponds to rows and second to columns of data["X"]
ratings = Relation(data["X"], "ratings", [users, movies], class_cut = 2.5);
## assign 500,000 of the observed ratings randomly to the test set
assignToTest!(ratings, 500_000)
## precision of the ratings to 1.5 (i.e., variance of 1/1.5)
setPrecision!(ratings, 1.5)
## the model (with only one relation)
RD = RelationData(ratings)
## run Gibbs sampler of Macau with 10 latent dimensions, total of 100 burnin and 400 posterior samples
result = macau(RD, burnin=100, psamples=400, clamp=[1.0, 5.0], num_latent=10)
In most applications the performance of pure BPMF is weaker compared to Macau. This is also true in the case of MovieLens dataset.
Tensor factorization with side information
Here is an example of factorization of 3-tensor on a toy data of compound x cell_line x gene
where two of the modes have side information. We first generate the toy dataset:
using BayesianDataFusion
using DataFrames
## generating artificial data
Ncompounds = 100
Ncell_lines = 10
Ngenes = 50
A = randn(Ncompounds, 2);
B = randn(Ncell_lines, 2);
C = randn(Ngenes, 2);
X = Float64[ sum(A[i,:].*B[j,:].*C[k,:]) for i in 1:size(A,1), j in 1:size(B,1), k in 1:size(C,1)];
## adding artificial data into DataFrame
df = DataFrame(compound=Int64[], cell_line=Int64[], gene=Int64[], value=Float64[])
for i=1:size(A,1), j=1:size(B,1), k=1:size(C,1)
push!(df, Any[i, j, k, X[i,j,k]])
end
## generating side information
Fcompounds = A * randn(2, 5);
Fgenes = C * randn(2, 10);
The dataframe object has an id for each mode and the last column gives the value of the tensor.
head(df)
6×4 DataFrames.DataFrame
│ Row │ compound │ cell_line │ gene │ value │
├─────┼──────────┼───────────┼──────┼───────────┤
│ 1 │ 1 │ 1 │ 1 │ -0.112793 │
│ 2 │ 1 │ 1 │ 2 │ 0.555784 │
│ 3 │ 1 │ 1 │ 3 │ 0.116109 │
│ 4 │ 1 │ 1 │ 4 │ 0.106436 │
│ 5 │ 1 │ 1 │ 5 │ 0.661452 │
│ 6 │ 1 │ 1 │ 6 │ -0.483896 │
Next we set up the 3-tensor model and add side information to compound
and gene
data
## creating the three entities (compounds and genes have side information)
compound = Entity("compound", F = Fcompounds);
cell_line = Entity("cell_line");
gene = Entity("gene", F = Fgenes);
## creating Tensor relation
gene_expr = Relation(df, "GeneExpr", [compound, cell_line, gene], class_cut = 0.0)
## setting noise precision of the observations (1 / variance)
setPrecision!(gene_expr, 5.0)
## assign 5000 values to test set
assignToTest!(gene_expr, 5000)
## the model with one 3-tensor
RD = RelationData(gene_expr)
## perform factorization
result = macau(RD, burnin=100, psamples=900, num_latent=10)
The dataframe result["predictions"]
gives the predictions on the test set.
For this toy dataset num_latent=10
is sufficient, for larger datasets it should be increased.
The model we executed looks like this:
julia> RD
[Relations]
GeneExpr: compound--cell_line--gene, #known = 45000, #test = 5000, α = 5.00
[Entities]
compound: 100 with 5 features (λ = sample)
cell_line: 10 with no features
gene: 50 with 10 features (λ = sample)
Multi-relational models
Macau also provides factorization of models with multiple relations, where each entity can have also side information.
Here is an example where we have 3 entities: users
, movies
, books
with 2 relations movie_ratings
and book_ratings
.
For illustration we add side information on movies
(8-dimensional) and books
(12-dimensional).
using BayesianDataFusion
## entities
Nusers = 50
Nmovies = 25
Nbooks = 20
users = Entity("users")
movies = Entity("movies", F = randn(Nmovies, 8)) # side information
books = Entity("books", F = randn(Nbooks, 12)) # side information
## relations, using random data
movie_ratings = Relation(sprand(Nusers, Nmovies, 0.2), "mratings", [users, movies], class_cut = 0.5)
book_ratings = Relation(sprand(Nusers, Nbooks, 0.3), "bratings", [users, books], class_cut = 0.5)
## assign 20 movie ratings to test set
assignToTest!(movie_ratings, 20)
## set precision of the observation noise (inverse of variance)
setPrecision!(movie_ratings, 1.5)
setPrecision!(book_ratings, 2.0)
## multi-relational model
RD = RelationData()
addRelation!(RD, movie_ratings)
addRelation!(RD, book_ratings)
## run Macau with 10 latent dimensions, total of 100 burnin and 400 posterior samples
result = macau(RD, burnin=100, psamples=400, num_latent=10)
Note that Macau reports RMSE on the test data of the first relation, which in this example was movie_ratings
.
To see the model structure you can check the model (even before the run).
julia> RD
[Relations]
mratings: users--movies, #known = 233, #test = 20, α = 1.50
bratings: users--books, #known = 314, #test = 0, α = 2.00
[Entities]
users: 50 with no features
movies: 25 with 8 features (λ = sample)
books: 20 with 12 features (λ = sample)
Saving latent vectors
To save the sampled latent variables to disk macau has output
parameter. If it is set to non-empty string macau
saves all posterior latent variable matrices and uses the output
value as a prefix for the file names. The prefix can also include the path, for example
result = macau(RD, output = "/home/user/mylatent")
which will save the files as /home/user/mylatent-...
.
Every sampled latent matrix of every entity will be saved as a separate file, stored by default in CSV format. There is also an option to save the output in binary 32bit floats (which will reduce the space required by 2x). This can be enabled by using output_type="binary"
.
To save the files using prefix mylatent
into the current working directory use output = "mylatent"
.
Reading latent matrix files
The saved CSV files can be read by the standard readdlm
function
U1 = readdlm("mylatent-entity1-01.csv", ',')
where "01" is the sample number that takes values from "01" (or "001") to psamples
.
If the output_type="binary"
was used, the written files can be read by using the function read_binary_float32
, for example
using BayesianDataFusion
U1 = read_binary_float32("mylatent-entity1-01.binary")
Saving link matrix (beta)
For the models that have features the sampled link matrices beta
can be saved by setting both output
to a prefix and output_beta = true
. For example:
result = macau(RD, output = "/home/user/mylatent", output_beta = true)
The beta matrices can be read similarly to the latent matrix files as described above, using either readdlm
for CSV files and read_binary_float32
for binary files.
Efficient storage of sparse matrices
The package also includes functions for writing and reading binary format for sparse matrix.
## random sparse matrix
X = sprand(100, 50, 0.1)
write_sparse_float32("X.sparse", X)
## reading back the rows, cols and values from the file
I, J, V = read_sparse_float32("X.sparse")
X2 = sparse(I, J, V)