Difference between revisions of "Wikidata import 2024-10-28 Virtuoso"

From BITPlan cr Wiki
Jump to navigation Jump to search
Line 19: Line 19:
  
 
=== Splitting the Dump ===
 
=== Splitting the Dump ===
 +
 +
Note: This was only a first approach. The split function took to lonk and thus the complete file was in the end used for the import. So For this import ignore this section
 +
 +
 
To load the dump in parallel into Virtuoso we need to split the dump into several files.
 
To load the dump in parallel into Virtuoso we need to split the dump into several files.
 
<syntaxhighlight lang="shell">
 
<syntaxhighlight lang="shell">
Line 24: Line 28:
 
$ split -l 5 -d --additional-suffix ".nt" ./latest-all.nt wd-dump-chunk-
 
$ split -l 5 -d --additional-suffix ".nt" ./latest-all.nt wd-dump-chunk-
 
</syntaxhighlight>
 
</syntaxhighlight>
 
  
 
== Virtuoso Docker ==
 
== Virtuoso Docker ==

Revision as of 10:58, 7 November 2024

Storage Overview

./hd/gamma/virtuoso
├── docker-compose.yml
├── scripts
    └── 10-bulkload.sql
└── wikidata
    └── data
        └── latest-all.nt.bz2
        └── latest-all.nt.graph

Providing the Wikidata Dump

Downloading the Dump

$wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.nt.bz2

Splitting the Dump

Note: This was only a first approach. The split function took to lonk and thus the complete file was in the end used for the import. So For this import ignore this section


To load the dump in parallel into Virtuoso we need to split the dump into several files.

$bzip2 -dk latest-all.nt.bz2
$ split -l 5 -d --additional-suffix ".nt" ./latest-all.nt wd-dump-chunk-

Virtuoso Docker

The password DBA_PASSWORD needs to be set before starting the docker container the first time.

version: "1.0"
services:
  virtuoso_db:
    container_name: wd_virtuoso
    image: openlink/virtuoso-opensource-7
    volumes:
      - ./data:/database/data
      - ./scripts:/opt/virtuoso-opensource/initdb.d
    environment:
      - DBA_PASSWORD=dba
      - VIRT_PARAMETERS_NUMBEROFBUFFERS=5450000
      - VIRT_PARAMETERS_MAXDIRTYBUFFERS=4000000
    ports:
      - "1111:1111"
      - "9700:8890"


Adjusted buffer size to 64GB of Memory usage (see https://vos.openlinksw.com/owiki/wiki/VOS/VirtRDFPerformanceTuning)

Bulk load script: 10-bulkload.sql

--
--  Copyright (C) 2022 OpenLink Software
--

--
--  Add all files that end in .nt
--
ld_dir_all ('data', '*.nt', NULL)
;

--
--  Add all files that end in .bz2, .gz, or .xz, to show that the Virtuoso bulk loader 
--  can load compressed files without manual decompression
--
ld_dir_all ('data', '*.bz2', NULL)
;

ld_dir_all ('data', '*.gz', NULL)
;

ld_dir_all ('data', '*.xz', NULL)
;

--
--  Now load all of the files found above into the database
--
rdf_loader_run()
;

--
--  End of script
--


Starting Docker Container and Bulk-load

Starting a new screen session

$screen -S wd_virtuoso

Stating the docker container

$docker compose up


Started the import at 2024-10-29T10:50:00 Load completed at 2024-11-01T12:13:43


Database Size after import

/hd/gamma/virtuoso/wikidata$ ls -lsh
total 907G
4.0K drwxrwxr-x 2 th   th   4.0K Oct 29 09:28 data
 12M -rw-r--r-- 1 root root  12M Nov  3 11:15 virtuoso-temp.db
907G -rw-r--r-- 1 root root 907G Nov  3 11:08 virtuoso.db
8.0K -rw-r----- 1 root root 7.1K Oct 29 10:48 virtuoso.ini
4.0K -rw-r--r-- 1 root root   11 Nov  3 11:08 virtuoso.lck
520K -rw-r--r-- 1 root root 515K Nov  3 11:08 virtuoso.log
   0 -rw-r--r-- 1 root root    0 Oct 29 10:48 virtuoso.pxa
 24K -rw-r--r-- 1 root root  17K Nov  3 11:18 virtuoso.trx