Website copiers

From Elvanör's Technical Wiki
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

All in all, I find wget better (more intuitive), even if it seems less configurable.

wget

  • wget can act has a powerful mirroring tool. Use it like this:
wget -p -k -H -nH -nd -E www.apple.com
  • The -p option will download dependent files like images or CSS, the -k activate link rewriting, -E activates file renaming (like moving a .php file to an .html), and -H means you can download from other hosts than the original one.
  • -nH and -nd are minor options affecting how directories are created.
  • You can specify a link depth level (compared to HTTrack, it is one less, eg a link level of 1 will already fetch the linked URLs, which is much more logical).
  • If you need to disable robot exclusion (robots.txt and others), use the -e robots=off command line option.
  • The -q option makes wget silent. -T specifies a timeout and -t the number of retries (default is 20).

Warnings

  • Be careful that wget outputs to the stderr, not stdout.

Limitations

  • You cannot rewrite URLs to an hardcoded location.
  • You cannot rename the main page downloaded to another name.
  • You cannot change the encoding of downloaded files.

HTTrack

  • This software seems useful but is quite complex and not very intuitive. Some important optios:
    • -n: this will activate fetches for related elements like CSS files and images (same as -p on wget). However it won't activate rewriting on those elements unless you have a higher link depth level, which make it less poerful than wget.
    • -e: similar to -H on wget.
    • -r10: this specifies the link depth level (10 in this example). Note that it starts at 2 and not one.
  • You can change the original directory hierarchy structure with some options.