Difference between revisions of "Wikidata import 2024-10-28 Virtuoso"
Jump to navigation
Jump to search
Line 19: | Line 19: | ||
=== Splitting the Dump === | === Splitting the Dump === | ||
+ | |||
+ | Note: This was only a first approach. The split function took to lonk and thus the complete file was in the end used for the import. So For this import ignore this section | ||
+ | |||
+ | |||
To load the dump in parallel into Virtuoso we need to split the dump into several files. | To load the dump in parallel into Virtuoso we need to split the dump into several files. | ||
<syntaxhighlight lang="shell"> | <syntaxhighlight lang="shell"> | ||
Line 24: | Line 28: | ||
$ split -l 5 -d --additional-suffix ".nt" ./latest-all.nt wd-dump-chunk- | $ split -l 5 -d --additional-suffix ".nt" ./latest-all.nt wd-dump-chunk- | ||
</syntaxhighlight> | </syntaxhighlight> | ||
− | |||
== Virtuoso Docker == | == Virtuoso Docker == |
Revision as of 10:58, 7 November 2024
Storage Overview
./hd/gamma/virtuoso
├── docker-compose.yml
├── scripts
└── 10-bulkload.sql
└── wikidata
└── data
└── latest-all.nt.bz2
└── latest-all.nt.graph
Providing the Wikidata Dump
Downloading the Dump
$wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.nt.bz2
Splitting the Dump
Note: This was only a first approach. The split function took to lonk and thus the complete file was in the end used for the import. So For this import ignore this section
To load the dump in parallel into Virtuoso we need to split the dump into several files.
$bzip2 -dk latest-all.nt.bz2
$ split -l 5 -d --additional-suffix ".nt" ./latest-all.nt wd-dump-chunk-
Virtuoso Docker
The password DBA_PASSWORD needs to be set before starting the docker container the first time.
version: "1.0"
services:
virtuoso_db:
container_name: wd_virtuoso
image: openlink/virtuoso-opensource-7
volumes:
- ./data:/database/data
- ./scripts:/opt/virtuoso-opensource/initdb.d
environment:
- DBA_PASSWORD=dba
- VIRT_PARAMETERS_NUMBEROFBUFFERS=5450000
- VIRT_PARAMETERS_MAXDIRTYBUFFERS=4000000
ports:
- "1111:1111"
- "9700:8890"
Adjusted buffer size to 64GB of Memory usage (see https://vos.openlinksw.com/owiki/wiki/VOS/VirtRDFPerformanceTuning)
Bulk load script: 10-bulkload.sql
--
-- Copyright (C) 2022 OpenLink Software
--
--
-- Add all files that end in .nt
--
ld_dir_all ('data', '*.nt', NULL)
;
--
-- Add all files that end in .bz2, .gz, or .xz, to show that the Virtuoso bulk loader
-- can load compressed files without manual decompression
--
ld_dir_all ('data', '*.bz2', NULL)
;
ld_dir_all ('data', '*.gz', NULL)
;
ld_dir_all ('data', '*.xz', NULL)
;
--
-- Now load all of the files found above into the database
--
rdf_loader_run()
;
--
-- End of script
--
Starting Docker Container and Bulk-load
Starting a new screen session
$screen -S wd_virtuoso
Stating the docker container
$docker compose up
Started the import at 2024-10-29T10:50:00 Load completed at 2024-11-01T12:13:43
Database Size after import
/hd/gamma/virtuoso/wikidata$ ls -lsh
total 907G
4.0K drwxrwxr-x 2 th th 4.0K Oct 29 09:28 data
12M -rw-r--r-- 1 root root 12M Nov 3 11:15 virtuoso-temp.db
907G -rw-r--r-- 1 root root 907G Nov 3 11:08 virtuoso.db
8.0K -rw-r----- 1 root root 7.1K Oct 29 10:48 virtuoso.ini
4.0K -rw-r--r-- 1 root root 11 Nov 3 11:08 virtuoso.lck
520K -rw-r--r-- 1 root root 515K Nov 3 11:08 virtuoso.log
0 -rw-r--r-- 1 root root 0 Oct 29 10:48 virtuoso.pxa
24K -rw-r--r-- 1 root root 17K Nov 3 11:18 virtuoso.trx