Website copiers: Difference between revisions

From Elvanör's Technical Wiki
Jump to navigation Jump to search
mNo edit summary
 
(5 intermediate revisions by the same user not shown)
Line 10: Line 10:
* -nH and -nd are minor options affecting how directories are created.
* -nH and -nd are minor options affecting how directories are created.
* You can specify a link depth level (compared to HTTrack, it is one less, eg a link level of 1 will already fetch the linked URLs, which is much more logical).
* You can specify a link depth level (compared to HTTrack, it is one less, eg a link level of 1 will already fetch the linked URLs, which is much more logical).
* The only limitation I found is that you cannot seem to rewrite URLs to an hardcoded location.
* If you need to disable robot exclusion (robots.txt and others), use the -e robots=off command line option.
* The -q option makes wget silent. -T specifies a timeout and -t the number of retries (default is 20).
 
== Warnings ==
 
* Be careful that wget outputs to the stderr, not stdout.
 
== Limitations ==
 
* You cannot rewrite URLs to an hardcoded location.
* You cannot rename the main page downloaded to another name.
* You cannot change the encoding of downloaded files.


= HTTrack =
= HTTrack =
Line 17: Line 28:
** -n: this will activate fetches for related elements like CSS files and images (same as -p on wget). However it won't activate rewriting on those elements unless you have a higher link depth level, which make it less poerful than wget.
** -n: this will activate fetches for related elements like CSS files and images (same as -p on wget). However it won't activate rewriting on those elements unless you have a higher link depth level, which make it less poerful than wget.
** -e: similar to -H on wget.
** -e: similar to -H on wget.
** -r10: this specifies the link depth level. Note that it starts at 2 and not one.
** -r10: this specifies the link depth level (10 in this example). Note that it starts at 2 and not one.


* You can change the original directory hierarchy structure with some options.
* You can change the original directory hierarchy structure with some options.

Latest revision as of 08:34, 19 January 2011

All in all, I find wget better (more intuitive), even if it seems less configurable.

wget

  • wget can act has a powerful mirroring tool. Use it like this:
wget -p -k -H -nH -nd -E www.apple.com
  • The -p option will download dependent files like images or CSS, the -k activate link rewriting, -E activates file renaming (like moving a .php file to an .html), and -H means you can download from other hosts than the original one.
  • -nH and -nd are minor options affecting how directories are created.
  • You can specify a link depth level (compared to HTTrack, it is one less, eg a link level of 1 will already fetch the linked URLs, which is much more logical).
  • If you need to disable robot exclusion (robots.txt and others), use the -e robots=off command line option.
  • The -q option makes wget silent. -T specifies a timeout and -t the number of retries (default is 20).

Warnings

  • Be careful that wget outputs to the stderr, not stdout.

Limitations

  • You cannot rewrite URLs to an hardcoded location.
  • You cannot rename the main page downloaded to another name.
  • You cannot change the encoding of downloaded files.

HTTrack

  • This software seems useful but is quite complex and not very intuitive. Some important optios:
    • -n: this will activate fetches for related elements like CSS files and images (same as -p on wget). However it won't activate rewriting on those elements unless you have a higher link depth level, which make it less poerful than wget.
    • -e: similar to -H on wget.
    • -r10: this specifies the link depth level (10 in this example). Note that it starts at 2 and not one.
  • You can change the original directory hierarchy structure with some options.