I have the following bashscript where I attempt to download the training, validation and label datasets from https://ai.google.com/research/ConceptualCaptions/download. However, it seems that downloading via the script will only download some metadata and is ~170kb whereas I am expecting 500MB++ for the file size.
That seems to be the case when I simply click on the download links. How would I go about getting the right links so that I can curl/ wget
the files to download.
I tried deleting the bit after .tsv
with no luck.
#!/bin/bash
# Array of URLs to be downloaded
URLS=(
"https://storage.cloud.google.com/gcc-data/Train/GCC-training.tsv?_ga=2.191230122.-1896153081.1529438250"
"https://storage.cloud.google.com/gcc-data/Validation/GCC-1.1.0-Validation.tsv?_ga=2.141047602.-1896153081.1529438250"
"https://storage.cloud.google.com/conceptual-captions-v1-1-labels/Image_Labels_Subset_Train_GCC-Labels-training.tsv?_ga=2.234395421.-20118413.1607637118"
)
# Array of output file names corresponding to the URLs
OUTPUT_FILES=(
"GCC-training.tsv"
"GCC-Validation.tsv"
"GCC-Labels-training.tsv"
)
# Loop through each URL and download the file
for i in "${!URLS[@]}"; do
URL=${URLS[$i]}
OUTPUT_FILE=${OUTPUT_FILES[$i]}
echo "Downloading $URL to $OUTPUT_FILE..."
curl -L -o $OUTPUT_FILE $URL
# Check if the download was successful
if [ $? -eq 0 ]; then
echo "Download of $OUTPUT_FILE completed successfully."
else
echo "Download of $OUTPUT_FILE failed."
fi
done