I recently launched a new site that needed to import/download legal public data files from around the web and repackage them as zip files, particularly if they were of another archive type. Here is the script I came up with after a lot of digging about:
#create urls variable array
declare urls=( `cat "url.txt" `)
#Download files from URL list
for m in "${urls[@]}"
do wget --load-cookies ~/cookies.txt "$m"
done
#Extract 7z and rar files to directories
7z x "*".7z -o"*"
7z x "*".rar -o"*"
#create directories list
ls | grep -v "\." > dir.txt
#create dirs variable array
declare dirs=( `cat "dir.txt" `)
#zip all directories
for t in "${dirs[@]}"
do zip -r $t.zip $t
done
#Cleanup everything else
rm -rf *.7z
rm -rf *.rar
rm -R ./*/
rm -rf dir.txt
mv *.zip /var/www/mywebsite.com/files/
So here is how the above works…
First, I manually create a url’s txt document. Basically just a document with a unique file download location on each line and I place it in the same folder as where the script is running from. That is the url.txt document referenced.
Next, I declare a variable based on that text file. I think this is considered an array as each line of the text file can be referenced individually bhy the rest of the code that follows. (I am no coder, so don’t ask too many questions…). The variable is called “urls”.
Then, I start a loop of running wget. The for “m” in blah blah blah statement somehow works out that each time this section of the script loops it calls the next url from the next line in the text document. So “m” references an individual URL each time the loop of wget runs. Then I run “wget” to download the file. Some of the sites I am pulling from require authentication data so I load a cookie with an authenticated session into wget. That cookie file is in my home folder on the server, hence the “~/” file location.
The loop will go through until all of the URL’s in the text document have been hit by wget and then it will be “done” and move on to the next part of the script.
The files I am getting are different archive formats. Some are already zip files and I can leave those alone, but some are 7zip and others are rar. Those need to be unpacked and then repacked as zips. So the next two commands use 7zip (which can handle all three archive formats) to unpack each rar or 7zip archive into its own folder.
Then, I use the “ls” command (which can list the contents of a linux directory) to list only the sub-directories (no files names) and output that to another text file called “dir.txt” – Hence, dir.txt becomes a list of all those directories I just created by unpacking rar and 7zip files.
Then we create another variable (array?) called “dirs” based on that text file. I then proceed into another loop that goes through that text file and for each directory (one on each line) it runs the zip command. It does this for each directory.
Finally, I clean up everything by deleting the original 7zip and rar files I downloaded, all of the sub-directories and finally the dir.txt file. Last but not least, I then take all of the zip files in my directory and move them to another location for consumption.
This is what I love about using the linux command-line. Anytime you find yourself doing a repetitive action in the CLI, most likely it can be turned into an automated script. I had to read quite a few different articles, forum posts, and wiki’s to cobble together my final solution. If you need something similar, hopefully this will help you save time! I am not a coder so I will try to address any questions but to be honestly I don’t fully understand all of the syntax, particularly around creating the arrays and then using them in the loop.