Difference between revisions of "Wikidata import 2024-10-28 Virtuoso"

From BITPlan cr Wiki
Jump to navigation Jump to search
(Created page with " == Storage Overview == <syntaxhighlight lang="bash"> ./hd/gamma/virtuoso ├── docker-compose.yml ├── scripts └── 10-bulkload.sql └── wikidata...")
 
Line 101: Line 101:
 
Started the import at 2024-10-29T10:50:00
 
Started the import at 2024-10-29T10:50:00
 
Load completed at 2024-11-01T12:13:43
 
Load completed at 2024-11-01T12:13:43
 +
 +
 +
=== Database Size after import ===
 +
<syntaxhighlight lang="bash">
 +
/hd/gamma/virtuoso/wikidata$ ls -lsh
 +
total 907G
 +
4.0K drwxrwxr-x 2 th  th  4.0K Oct 29 09:28 data
 +
12M -rw-r--r-- 1 root root  12M Nov  3 11:15 virtuoso-temp.db
 +
907G -rw-r--r-- 1 root root 907G Nov  3 11:08 virtuoso.db
 +
8.0K -rw-r----- 1 root root 7.1K Oct 29 10:48 virtuoso.ini
 +
4.0K -rw-r--r-- 1 root root  11 Nov  3 11:08 virtuoso.lck
 +
520K -rw-r--r-- 1 root root 515K Nov  3 11:08 virtuoso.log
 +
  0 -rw-r--r-- 1 root root    0 Oct 29 10:48 virtuoso.pxa
 +
24K -rw-r--r-- 1 root root  17K Nov  3 11:18 virtuoso.trx
 +
</syntaxhighlight>

Revision as of 11:22, 3 November 2024

Storage Overview

./hd/gamma/virtuoso
├── docker-compose.yml
├── scripts
    └── 10-bulkload.sql
└── wikidata
    └── data
        └── latest-all.nt.bz2
        └── latest-all.nt.graph

Providing the Wikidata Dump

Downloading the Dump

$wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.nt.bz2

Splitting the Dump

To load the dump in parallel into Virtuoso we need to split the dump into several files.

$bzip2 -dk latest-all.nt.bz2
$ split -l 5 -d --additional-suffix ".nt" ./latest-all.nt wd-dump-chunk-


Virtuoso Docker

The password DBA_PASSWORD needs to be set before starting the docker container the first time.

version: "1.0"
services:
  virtuoso_db:
    container_name: wd_virtuoso
    image: openlink/virtuoso-opensource-7
    volumes:
      - ./data:/database/data
      - ./scripts:/opt/virtuoso-opensource/initdb.d
    environment:
      - DBA_PASSWORD=dba
      - VIRT_PARAMETERS_NUMBEROFBUFFERS=5450000
      - VIRT_PARAMETERS_MAXDIRTYBUFFERS=4000000
    ports:
      - "1111:1111"
      - "9700:8890"


Adjusted buffer size to 64GB of Memory usage (see https://vos.openlinksw.com/owiki/wiki/VOS/VirtRDFPerformanceTuning)

Bulk load script: 10-bulkload.sql

--
--  Copyright (C) 2022 OpenLink Software
--

--
--  Add all files that end in .nt
--
ld_dir_all ('data', '*.nt', NULL)
;

--
--  Add all files that end in .bz2, .gz, or .xz, to show that the Virtuoso bulk loader 
--  can load compressed files without manual decompression
--
ld_dir_all ('data', '*.bz2', NULL)
;

ld_dir_all ('data', '*.gz', NULL)
;

ld_dir_all ('data', '*.xz', NULL)
;

--
--  Now load all of the files found above into the database
--
rdf_loader_run()
;

--
--  End of script
--


Starting Docker Container and Bulk-load

Starting a new screen session

$screen -S wd_virtuoso

Stating the docker container

$docker compose up


Started the import at 2024-10-29T10:50:00 Load completed at 2024-11-01T12:13:43


Database Size after import

/hd/gamma/virtuoso/wikidata$ ls -lsh
total 907G
4.0K drwxrwxr-x 2 th   th   4.0K Oct 29 09:28 data
 12M -rw-r--r-- 1 root root  12M Nov  3 11:15 virtuoso-temp.db
907G -rw-r--r-- 1 root root 907G Nov  3 11:08 virtuoso.db
8.0K -rw-r----- 1 root root 7.1K Oct 29 10:48 virtuoso.ini
4.0K -rw-r--r-- 1 root root   11 Nov  3 11:08 virtuoso.lck
520K -rw-r--r-- 1 root root 515K Nov  3 11:08 virtuoso.log
   0 -rw-r--r-- 1 root root    0 Oct 29 10:48 virtuoso.pxa
 24K -rw-r--r-- 1 root root  17K Nov  3 11:18 virtuoso.trx