Database

Administration with PgAdmin

PgAdmin can be accessed through http(s)://<node domain name>/pgadmin with username admin and password admin (by default). To access the test database just create a server with the name MedCo, the address postgresql, username postgres and password postgres (by default). Note that PgAdmin is not included in the production deployments.
PgAdmin server configuration.

Managing Large Databases: Data Loading

Database modifications

1
-- structure
2
ALTER TABLE i2b2demodata_i2b2.observation_fact
3
ALTER COLUMN instance_num TYPE bigint,
4
ALTER COLUMN text_search TYPE bigint;
5
6
-- settings
7
ALTER SYSTEM SET maintenance_work_mem TO '32GB';
8
SELECT pg_reload_conf();
Copied!

Data generation

All the following operations are implemented through PL/pgSQL functions present in the MedCo-i2b2 database.

Duplication

The parameter of the following functions corresponds to the number of times the existing observation_fact table should be added to itself. For example with 3 , the number of rows of the table will be multiplied by 4.
  • Method 1 (double for loop)
1
SELECT i2b2demodata_i2b2.obs_fact_duplication_method_1(3);
Copied!
  • Method 2 (temporary table)
1
SELECT i2b2demodata_i2b2.obs_fact_duplication_method_2(3);
Copied!

Reduction

The parameter of the following function corresponds to the number of rows the resulting observation_fact table will have.
1
SELECT i2b2demodata_i2b2.obs_fact_reduction(2370000000);
Copied!

Indexes

The following command builds only the i2b2 indexes needed by MedCo. I2b2 offers by default more of them to enable features not currently supported by MedCo.
1
SELECT i2b2demodata_i2b2.obs_fact_indexes();
Copied!

MedCo setup

PostgreSQL database files

In the docker-compose.common.yml of the running profile, add in the postgresql service the following key, pointing to where the database files are:
1
volumes:
2
- PATH_TO_DATABASE_FILES/_data:/var/lib/postgresql/data
Copied!

Timeouts

Depending on the size of the database, you might need to increase several timeouts, notably the following:
  • I2B2_WAIT_TIME_SECONDS
  • SERVER_HTTP_WRITE_TIMEOUT_SECONDS
  • CLIENT_QUERY_TIMEOUT_SECONDS
  • ALL_TIMEOUTS_SECONDS

Backed Up Data

Backup Server

  • SSH: icsil1-ds-49.epfl.ch ; accounts: root / lca1
  • SSH Key of Mickaël and Joao set up (ask one of them for access)
  • 20TB storage in /srv/, publicly accessible read-only through rsync daemon
  • 10Gbps link with IC Clusters through interface 10.90.48.141

Data Stored in /srv/

/srv/databases

  • pg_volume_nodeX_x1212_indexes Databases of 3 nodes (X=0,1,2), with approximately 150k patients and 28.5B records total (50k patients and 9.5B records per node). It is a x1212 duplication of the original tcga_bio dataset. It also contains the built indexes. The size of each is around 3TB, which maybe can be reduced by running a PostgreSQL FULL VACUUM.
  • pg_volume_node0_XB_indexes Those databases are reductions of the pg_volume_node0_x1212_indexes database, with X=4.75,3.17,2.37 billion records. Those numbers were calculated to keep a total number of 28.5B rows with respectively 6, 9 and 12 nodes.

/srv/deployments

  • icclusters-deployment-backup-11-07-19 Contains all the deployment profiles and keys used for the pg_volume_nodeX_x1212_indexes databases.
  • postgresql-deployment Local deployment of postgresql and pgAdmin, in order to explore or modify the databases.
  • nebulaexp-alldeployments Contains all the deployment profiles and keys used for the pg_volume_node0_XB_indexes databases.

/srv/logs

  • duplications-nodeX Logs of data duplication (x100, x2, x3) and indexes building for the pg_volume_nodeX_x1212_indexes databases.
  • reductions-node0 Logs of data reduction (to 2.37B records) and indexes building of pg_volume_node0_x1212_indexes database.

Copying data to/from IC-Cluster machines

Enabling rsync daemon on a linux machine

Using the rsync daemon allows for easier data copy between machines. It is already enabled on the backup server (read-only), but in some cases it can be useful on other machines.
In the file /etc/default/rsync set the variable: RSYNC_ENABLE=true. Create the file /etc/rsyncd.conf with the following content, adapted to your case:
1
uid = root
2
gid = root
3
4
[disk]
5
path = /disk/
6
comment = MedCo data (read-only)
7
read only = yes
Copied!
Then start the daemon with service rsync start.

Copy data with rsync

Example from an IC Cluster machine: rsync -a -v rsync://10.90.48.141/srv/databases/pg_volume_node0_x1212_indexes /disk/pg_volume
Last modified 1yr ago