Database
Administration with PgAdmin
PgAdmin can be accessed through http(s)://<node domain name>/pgadmin
with username admin
and password admin
(by default). To access the test database just create a server with the name MedCo
, the address postgresql
, username postgres
and password postgres
(by default). Note that PgAdmin is not included in the production deployments.
Managing Large Databases: Data Loading
Database modifications
Data generation
All the following operations are implemented through PL/pgSQL functions present in the MedCo-i2b2 database.
Duplication
The parameter of the following functions corresponds to the number of times the existing observation_fact
table should be added to itself. For example with 3 , the number of rows of the table will be multiplied by 4.
Method 1 (double for loop)
Method 2 (temporary table)
Reduction
The parameter of the following function corresponds to the number of rows the resulting observation_fact
table will have.
Indexes
The following command builds only the i2b2 indexes needed by MedCo. I2b2 offers by default more of them to enable features not currently supported by MedCo.
MedCo setup
PostgreSQL database files
In the docker-compose.common.yml
of the running profile, add in the postgresql
service the following key, pointing to where the database files are:
Timeouts
Depending on the size of the database, you might need to increase several timeouts, notably the following:
I2B2_WAIT_TIME_SECONDS
SERVER_HTTP_WRITE_TIMEOUT_SECONDS
CLIENT_QUERY_TIMEOUT_SECONDS
ALL_TIMEOUTS_SECONDS
Backed Up Data
Backup Server
SSH: icsil1-ds-49.epfl.ch ; accounts: root / lca1
SSH Key of Mickaël and Joao set up (ask one of them for access)
20TB storage in
/srv/
, publicly accessible read-only through rsync daemon10Gbps link with IC Clusters through interface 10.90.48.141
Data Stored in /srv/
/srv/
/srv/databases
/srv/databases
pg_volume_nodeX_x1212_indexes
Databases of 3 nodes (X=0,1,2
), with approximately 150k patients and 28.5B records total (50k patients and 9.5B records per node). It is a x1212 duplication of the originaltcga_bio
dataset. It also contains the built indexes. The size of each is around 3TB, which maybe can be reduced by running a PostgreSQL FULL VACUUM.pg_volume_node0_XB_indexes
Those databases are reductions of thepg_volume_node0_x1212_indexes
database, withX=4.75,3.17,2.37
billion records. Those numbers were calculated to keep a total number of 28.5B rows with respectively 6, 9 and 12 nodes.
/srv/deployments
/srv/deployments
icclusters-deployment-backup-11-07-19
Contains all the deployment profiles and keys used for thepg_volume_nodeX_x1212_indexes
databases.postgresql-deployment
Local deployment of postgresql and pgAdmin, in order to explore or modify the databases.nebulaexp-alldeployments
Contains all the deployment profiles and keys used for thepg_volume_node0_XB_indexes
databases.
/srv/logs
/srv/logs
duplications-nodeX
Logs of data duplication (x100, x2, x3) and indexes building for thepg_volume_nodeX_x1212_indexes
databases.reductions-node0
Logs of data reduction (to 2.37B records) and indexes building ofpg_volume_node0_x1212_indexes
database.
Copying data to/from IC-Cluster machines
Enabling rsync daemon on a linux machine
Using the rsync daemon allows for easier data copy between machines. It is already enabled on the backup server (read-only), but in some cases it can be useful on other machines.
In the file /etc/default/rsync
set the variable: RSYNC_ENABLE=true
. Create the file /etc/rsyncd.conf
with the following content, adapted to your case:
Then start the daemon with service rsync start
.
Copy data with rsync
Example from an IC Cluster machine: rsync -a -v rsync://10.90.48.141/srv/databases/pg_volume_node0_x1212_indexes /disk/pg_volume
Last updated