Bioinformatics toolbox for federated pooled data analysis: how-to guides

Using DataSHIELD

The purpose of this guide is to give practical advice to DataSHIELD users for running analyses using DataSHIELD. This guide does not intend to cover the basics such as connecting to servers or logging in, but aims to tackle some of the real-world issues that will be encountered when running analyses.

Virtual pooling vs. SLMA

A decision to be made during analysis planning as it influences harmonisation

SLMA	Virtual pooling
No need to have all variables in all cohorts, nor harmonise confounders (although it might be better if they are harmonised)	All variables must be harmonised, more harmonisation work
Either exclude cohorts that are missing a variable for a particular analysis (this must be done if the missing variable is exposure or outcome), or include them but manage the missing variable (confounder) by removing it from models etc.	If cohorts are missing a variable, then exclude them for a particular analysis.
Run model per cohort, small cohorts fail to fit model, need to manage this in code	Not a concern as all cohorts virtually pooled
Account for unobserved heterogeneity with a random effects meta analysis	Could use a cohort variable to allow different intercepts but this is not as comprehensive as a random effects meta analysis
Survival analysis complex to set up due to piecewise Poisson method – need buckets per cohort (but could easily implement Cox models and pool results)	Survival analysis less complex – only 1 set of buckets for whole study needed. But can’t do Cox yet.

Data Quality

This refers to checks that should be done after harmonisation is complete. Probably this is covered in the LifeCycle or more recent Maelstrom work.

However, some checks are still needed in DataSHIELD before starting work. The main thing is to make sure that all cohorts have the same data format for each variable. This involves using ds.class() and checking you get the same in all cohorts. Otherwise you get a lot of problems later on.

Data preparation

This is a bit dependent on how Harmonisation and upload has been done. But for InterConnect, we had these steps done in DataSHIELD because then it was easy to generate numbers after each step to show how to population studied was obtained. Also it is easier to change the thresholds. For example, we might eliminate participants who:

Already have diabetes
Have high energy intake (N.B. different thresholds for men and women)
Have low energy intake (N.B. different thresholds for men and women)
Don’t have complete data

The final point is actually a tricky one when not all cohorts have all variables to be used in the analysis. Variables known to be missing for all participants in a cohort should not be used as part of the elimination, or the whole cohort will be removed. This needs to be managed in the code. One way of doing it is keeping a master table of which cohorts are missing each variable. Then a function is needed to return a list of variables for each cohort that can be used safely in the elimination by removing them from the data frame, and then sticking them back on later.

Special extras for survival analysis

You need to set up the buckets – quintiles of follow up time

In a nested case-cohort cohort a set of weights needs to be generated and included in the piecewise poisson regression.

Then use the lexis function to set it all up

Developing a Cox SLMA will help with this.

Save workspaces

This is useful to avoid going through the preparation every time

Running analyses

It is likely that a set of models of increasing complexity will be specified, with each model being run for different combinations of exposure and outcome. It is useful to run each model in a loop and store each model/variable combination’s results in a file. The loop can write the variables used into the filename etc. The results generated from the loops can just be rough and ready. The saved files can be read in later to generate publication figures.

Before fitting the model for a particular exposure-outcome combination it is necessary to generate a list of cohorts that have the variables required. This can be done with a function that checks against the master table. This must be done for exposures and outcomes, and for confounders in the pooled scenario. In the SLMA scenario, confounders can be left out for some cohorts and hence a further function is needed to remove missing variables from the list of confounders for that cohort.

Note that for smaller cohorts in a SLMA sometimes the model will fail to fit for a particular exposure and outcome. Capture this in the code so that the cohort is excluded for that model/variable combination.