16 Simulate network data

It is possible to simulate data for a given network using BayesNetty. The network must be a bnlearn network, see section 7, and can be set using a network file which has the same format as a posterior file and sets the network structure and the network parameters. Alternatively, rather than reading in a posterior file, the simulating network can be set by first (in the same parameter file) carrying out any analysis task in BayesNetty that results in a bnlearn network with calculated posteriors.

16.1 Options

The options are as follows:

Option Description Default
-simulate-network-data do a task to simulate network data for a given network
-simulate-network-data-name name label the task with a name Task-n
-simulate-network-data-network-name network simulate data for this network previous network (or the default model given by a node for each data variable and no edges if there is no previous network)
-simulate-network-data-no-sims n simulate n replicates of network data 100
-simulate-network-data-parameter-file parameters.txt network with parameters in bnlearn posteriors file format
-simulate-network-data-whitelist-file whitelist.dat a list of edges that must be included in any network
-simulate-network-data-blacklist-file blacklist.dat a list of edges that must not be included in any network
-simulate-network-data-score score for a bnlearn network choose between loglike, AIC or BIC BIC

If posteriors are not given then default network node parameters are used to simulate data. The default probabilities for a discrete SNP node are 0.25, 0.5 and 0.25 for levels 0, 1 and 2 respectively. A minor allele frequency of 0.5 would give these probabilites. For a discrete node the levels are given by equal probabilities. For a continuous node the intercept is set to 10 and the coefficients and variance are set to 1.

The following sections show some examples of simulating data and can all be found in example.zip

16.2 Example 1: no data

This example simulates node data using a given network structure with network parameters.

The network structure with network parameters must be given to simulate network data. These can be set by using a network parameters file which takes the same format as a posterior file and may be quite complex. To create this network parameter file it is therefore recommended to first output a posterior file for the network that you wish to simulate data for and then to edit this posterior file as required.

To create a network parameter file, example-network-parameters.txt, run the parameter file, paras-make-network-paras.txt:

#input the example network in format 1
-input-network
-input-network-file example-network-sim.dat

#simulate data using default parameters
-simulate-network-data

#calculate the posterior of the network using default parameters
-calc-posterior

#output the posteriors to be used as a network parameters file
-output-posteriors
-output-posteriors-file example-network-parameters.txt

The example network file, example-network-sim.dat, is in network format 1, see section 6.5, except that extra (optional) node information is given as there is no node data to determine the node type. After each node either |dis.snp, |cts.snp, |dis|2 or |cts may be written to give a discrete SNP node, a continuous SNP node, a discrete node with the number of levels or a continuous node respectively. If the type of node is not set then the node is set to a continuous node. If a discrete node is specified without the number of levels then the number of levels is set to 0. The example network file is as follows.

rs1|dis.snp
rs2|dis.snp
mood|dis|2
express|cts
pheno|cts
rs1 pheno
rs2 pheno
mood express
pheno express

The network parameter file, example-network-parameters.txt, can then be created by running the parameter file

./bayesnetty paras-make-network-paras.txt

The output will look something as follows

BayesNetty: Bayesian Network software, v1.00
--------------------------------------------------
Copyright 2015-present Richard Howey, GNU General Public License, v3
Institute of Genetic Medicine, Newcastle University

Random seed: 1551712417
--------------------------------------------------
Task name: Task-1
Loading network
Network file: example-network-sim.dat
Network type: bnlearn
Network score type: BIC
Total number of nodes: 5 (Discrete: 0 | Factor: 0 | Continuous: 0 | No data: 5)
Total number of edges: 4
Network Structure: [rs1][rs2][mood][pheno|rs1:rs2][express|mood:pheno]
The network has nodes with no data
--------------------------------------------------
--------------------------------------------------
Task name: Task-2
Simulating network data
Data simulation network given by network: Task-1
Number of simulations: 100
Network: Task-2
Network type: bnlearn
Network score type: BIC
Total number of nodes: 5 (Discrete: 3 | Factor: 0 | Continuous: 2)
Total number of edges: 4
Network Structure: [rs1][rs2][mood][pheno|rs1:rs2][express|mood:pheno]
Total data at each node: 100
Missing data at each node: 0
--------------------------------------------------
--------------------------------------------------
Task name: Task-3
Calculating network score
Network: Task-2
Network structure: [rs1][rs2][mood][pheno|rs1:rs2][express|mood:pheno]
Network score type: BIC
Network score = -600.611
--------------------------------------------------
--------------------------------------------------
Task name: Task-4
Outputting posteriors
Network: Task-2
Network Structure: [rs1][rs2][mood][pheno|rs1:rs2][express|mood:pheno]
Output posteriors to file: example-network-parameters.txt
--------------------------------------------------

Run time: less than one second

The network parameter file, example-network-parameters.txt, will look something like as follows:

Posteriors:
===========

DISCRETE SNP NODE: rs1
  0: 0.26
  1: 0.49
  2: 0.25

DISCRETE SNP NODE: rs2
  0: 0.25
  1: 0.51
  2: 0.24

DISCRETE NODE: mood
  0: 0.47
  1: 0.53

CONTINUOUS NODE: express
 DISCRETE PARENTS: mood
 0:
  Intercept: 9.71024
  Coefficients: pheno: 1.04254 
  Mean: 20.2202
  Variance: 0.811076
 1:
  Intercept: 9.89
  Coefficients: pheno: 1.00832 
  Mean: 19.9452
  Variance: 0.6773

CONTINUOUS NODE: pheno
 DISCRETE PARENTS: rs1:rs2
 0:0:
  Intercept: 10.4619
  Coefficients: 
  Mean: 10.4619
  Variance: 2.07491
 1:0:
  Intercept: 9.97966
  Coefficients: 
  Mean: 9.97966
  Variance: 1.7134
 2:0:
  Intercept: 9.58534
  Coefficients: 
  Mean: 9.58534
  Variance: 0.532354
 0:1:
  Intercept: 10.0514
  Coefficients: 
  Mean: 10.0514
  Variance: 1.44765
 1:1:
  Intercept: 9.96623
  Coefficients: 
  Mean: 9.96623
  Variance: 0.760351
 2:1:
  Intercept: 9.54098
  Coefficients: 
  Mean: 9.54098
  Variance: 0.559675
 0:2:
  Intercept: 10.7559
  Coefficients: 
  Mean: 10.7559
  Variance: 1.78883
 1:2:
  Intercept: 10.1066
  Coefficients: 
  Mean: 10.1066
  Variance: 0.983235
 2:2:
  Intercept: 10.3842
  Coefficients: 
  Mean: 10.3842
  Variance: 1.50665

The data was simulated using default network node parameters. The simulated node data and subsequent fitted parameters are thus close to these values.

The network parameter file, example-network-parameters.txt, can now be edited using parameters of your choice. The mean is not required, this simply reports the mean of the node data for continuous nodes. The “Posteriors” title in the file is also not required, but these may be left in the file. The levels of discrete nodes are labelled 0, 1, 2 etc. These may be renamed to something more meaningful in this file. For example, for node “mood” the levels could be renamed “sad” and “happy”.

Finally, network node data may be simulated for a network with chosen network parameters where initially there was no data available for the network.

#simulate data
-simulate-network-data
-simulate-network-data-no-sims 200
-simulate-network-data-parameter-file example-network-parameters.txt

#output simulated data
-output-network
-output-network-node-data-file-prefix sim-data
-output-network-node-data-bed-file

The parameter file shown above, paras-sim-data1.txt, can then be used to simulate data and output it to several data files.

./bayesnetty paras-sim-data1.txt

The output will look something as follows

BayesNetty: Bayesian Network software, v1.00
--------------------------------------------------
Copyright 2015-present Richard Howey, GNU General Public License, v3
Institute of Genetic Medicine, Newcastle University

Random seed: 1551712579
--------------------------------------------------
Task name: Task-1
Simulating network data
Number of simulations: 200
Parameter file name: example-network-parameters.txt
Network: Task-1
Network type: bnlearn
Network score type: BIC
Total number of nodes: 5 (Discrete: 3 | Factor: 0 | Continuous: 2)
Total number of edges: 4
Network Structure: [rs1][rs2][mood][pheno|rs1:rs2][express|mood:pheno]
Total data at each node: 200
Missing data at each node: 0
--------------------------------------------------
--------------------------------------------------
Task name: Task-2
Outputting network
Network: Task-1
Network Structure: [rs1][rs2][mood][pheno|rs1:rs2][express|mood:pheno]
Network output to file: network.dat
Node data output to files:
 sim-data-discrete.dat
 sim-data-cts.dat
 sim-data.bed/.bim/.fam
--------------------------------------------------

Run time: less than one second

The file sim-data-discrete.dat contains the discrete node data, sim-data-cts.dat the continuous node data and sim-data.bed, sim-data.bim and sim-data.fam the SNP node data in PLINK binary pedigree format.

16.3 Example 2: data and fitted network

This example inputs some data, sets the network structure, fits network posterior parameters and then simulates network node data using these parameters. The simulated data is then output to file. The parameter file, paras-sim-data2.txt, in the example files does this and is as follows:

#input continuous data
-input-data
-input-data-file example-cts.dat
-input-data-cts

#input discrete data
-input-data
-input-data-file example-discrete.dat
-input-data-discrete

#input SNP data as discrete data
-input-data
-input-data-file example.bed
-input-data-discrete-snp

#input the example network in format 1
-input-network
-input-network-file example-network-format1.dat

#simulate data
-simulate-network-data
-simulate-network-data-no-sims 200

#output simulated data
-output-network
-output-network-node-data-file-prefix sim-data
-output-network-node-data-bed-file
./bayesnetty paras-sim-data2.txt

The output will look something as follows

BayesNetty: Bayesian Network software, v1.00
--------------------------------------------------
Copyright 2015-present Richard Howey, GNU General Public License, v3
Institute of Genetic Medicine, Newcastle University

Random seed: 1551716397
--------------------------------------------------
Task name: Task-1
Loading data
Continuous data file: example-cts.dat
Number of ID columns: 2
Including (all) 2 variables in analysis
Each variable has 1500 data entries
Missing value: not set
--------------------------------------------------
--------------------------------------------------
Task name: Task-2
Loading data
Discrete data file: example-discrete.dat
Number of ID columns: 2
Including the 1 and only variable in analysis
Each variable has 1500 data entries
Missing value: NA
--------------------------------------------------
--------------------------------------------------
Task name: Task-3
Loading data
SNP binary data file: example.bed
SNP data treated as discrete data
Total number of SNPs: 2
Total number of subjects: 1500
Number of ID columns: 2
Including (all) 2 variables in analysis
Each variable has 1500 data entries
--------------------------------------------------
--------------------------------------------------
Task name: Task-4
Loading network
Network file: example-network-format1.dat
Network type: bnlearn
Network score type: BIC
Total number of nodes: 5 (Discrete: 3 | Factor: 0 | Continuous: 2)
Total number of edges: 4
Network Structure: [mood][rs1][rs2][pheno|rs1:rs2][express|pheno:mood]
Total data at each node: 1495
Missing data at each node: 5
--------------------------------------------------
--------------------------------------------------
Task name: Task-5
Simulating network data
Data simulation network given by network: Task-4
Number of simulations: 200
Network: Task-5
Network type: bnlearn
Network score type: BIC
Total number of nodes: 5 (Discrete: 3 | Factor: 0 | Continuous: 2)
Total number of edges: 4
Network Structure: [mood][rs1][rs2][pheno|rs1:rs2][express|pheno:mood]
Total data at each node: 200
Missing data at each node: 0
--------------------------------------------------
--------------------------------------------------
Task name: Task-6
Outputting network
Network: Task-5
Network Structure: [mood][rs1][rs2][pheno|rs1:rs2][express|pheno:mood]
Network output to file: network.dat
Node data output to files:
 sim-data-discrete.dat
 sim-data-cts.dat
 sim-data.bed/.bim/.fam
--------------------------------------------------

Run time: less than one second

As in the previous example the simulated node data is output to a number of files.

16.4 Example 3: data and unknown network

In this example the network that the data will be simulated for is not known initially. To do this, some data is input, the best fitting network is chosen using a network search and then node data is simulated using the fitted parameters. The parameter file, paras-sim-data3.txt, in the example files does this and is as follows:

#input continuous data
-input-data
-input-data-file example-cts.dat
-input-data-cts

#input discrete data
-input-data
-input-data-file example-discrete.dat
-input-data-discrete

#input SNP data as discrete data
-input-data
-input-data-file example.bed
-input-data-discrete-snp

#search for the best fitting model
-search-models

#simulate data
-simulate-network-data
-simulate-network-data-no-sims 200

#output simulated data
-output-network
-output-network-node-data-file-prefix sim-data
-output-network-node-data-bed-file
./bayesnetty paras-sim-data3.txt

The output will look something as follows

BayesNetty: Bayesian Network software, v1.00
--------------------------------------------------
Copyright 2015-present Richard Howey, GNU General Public License, v3
Institute of Genetic Medicine, Newcastle University

Random seed: 1551716718
--------------------------------------------------
Task name: Task-1
Loading data
Continuous data file: example-cts.dat
Number of ID columns: 2
Including (all) 2 variables in analysis
Each variable has 1500 data entries
Missing value: not set
--------------------------------------------------
--------------------------------------------------
Task name: Task-2
Loading data
Discrete data file: example-discrete.dat
Number of ID columns: 2
Including the 1 and only variable in analysis
Each variable has 1500 data entries
Missing value: NA
--------------------------------------------------
--------------------------------------------------
Task name: Task-3
Loading data
SNP binary data file: example.bed
SNP data treated as discrete data
Total number of SNPs: 2
Total number of subjects: 1500
Number of ID columns: 2
Including (all) 2 variables in analysis
Each variable has 1500 data entries
--------------------------------------------------
--------------------------------------------------
Task name: Task-4
Searching network models
--------------------------------------------------
Loading defaultNetwork network
Network type: bnlearn
Network score type: BIC
Total number of nodes: 5 (Discrete: 3 | Factor: 0 | Continuous: 2)
Total number of edges: 0
Network Structure: [express][pheno][mood][rs1][rs2]
Total data at each node: 1495
Missing data at each node: 5
--------------------------------------------------
Network: defaultNetwork
Search: Greedy
Random restarts: 0
Random jitter restarts: 0
Network Structure: [mood][rs1][rs2][express|rs1:rs2][pheno|express:mood]
Network score type: BIC
Network score = -8213.45
--------------------------------------------------
--------------------------------------------------
Task name: Task-5
Simulating network data
Data simulation network given by network: defaultNetwork
Number of simulations: 200
Network: Task-5
Network type: bnlearn
Network score type: BIC
Total number of nodes: 5 (Discrete: 3 | Factor: 0 | Continuous: 2)
Total number of edges: 4
Network Structure: [mood][rs1][rs2][express|rs1:rs2][pheno|express:mood]
Total data at each node: 200
Missing data at each node: 0
--------------------------------------------------
--------------------------------------------------
Task name: Task-6
Outputting network
Network: Task-5
Network Structure: [mood][rs1][rs2][express|rs1:rs2][pheno|express:mood]
Network output to file: network.dat
Node data output to files:
 sim-data-discrete.dat
 sim-data-cts.dat
 sim-data.bed/.bim/.fam
--------------------------------------------------

Run time: less than one second

As in the previous example the simulated node data is output to a number of files.