Harvesting scripts and docker issues
There are problems with the harvesting scripts, instructions and docker integrations. Here are the ones I had when trying to harvest the data.
README instructions incorrect
The README says:
The
-appparameter will trigger a harvest of the resources stored in the Git LFS subdirectoriesdata/rareanddata/faidarefiltered or not (wheatisandbrc4envrely onfaidareandraredata respectively).
But that's not actually the case. We still have to pass the -data option when using the -app option.
The README also shows example docker commands for indexing RARe data, but they're missing the -data option, which is necessary.
Dockerfile reproducibility
The Dockerfile uses an untagged base image (alpine). So from one build to another of the Dockerfile, we don't end up with the same result.
And it's an issue because with the current alpine base image, the shell script files don't run correctly. In particular, they use find -ls and the -ls option doesn't exist when run using docker.
Besides, even when removing that option from the shell scripts, building the image and running it, the script ends up failing with the following error:
Index data and suggestions...
Using timestamp corresponding to date: Fri Dec 20 10:47:26 UTC 2024
Indexing files from /opt/data/faidare/data into index located on elasticsearch:9200/faidare_search_dev-tmstp1734691646-resource-index with 4 parallel threads...
0% 0:8=0s /opt/data/faidare/data/datadiscovery-1.json.gz elasticsearch parallel: This job failed:
index_resources /opt/data/faidare/data/INRAE-URGI_Alvis_OMTD_1.json.gz data-INRAE-URGI_Alvis_OMTD_1.json elasticsearch
real 0m0.223s
user 0m0.179s
sys 0m0.106s
A problem occured (code=2) when trying to index data
from /opt/data/faidare on faidare application and on dev environment
ERROR related to Elasticsearch API usage found when indexing /tmp/bulk/faidare-dev/data-INRAE-URGI_Alvis_OMTD_1.json-resources.log.gz
ERROR related to Elasticsearch API usage found when indexing /tmp/bulk/faidare-dev/data-INRAE-URGI_unfiltered_AoEwM2EyMTllZjU2ZWY4ZTM2YQ.json-resources.log.gz
ERROR related to Elasticsearch API usage found when indexing /tmp/bulk/faidare-dev/data-IPK_unfiltered_3.json-resources.log.gz
ERROR related to Elasticsearch API usage found when indexing /tmp/bulk/faidare-dev/data-datadiscovery-1.json-resources.log.gz
Error when indexing data, see errors above. Exiting.