Help!

wget - the question for advanced users

 
  

Post new topic   General Reply to Topic (not reply to a specific post)    Forums Home -> General Discussions (archive 2) RSS
Next:  Printing from WinXP to Samba/Linux (Continuing fr..  
Author Message
Piotr
External


Since: Dec 27, 2006
Posts: 5



PostPosted: Wed Dec 27, 2006 2:32 pm    Post subject: wget - the question for advanced users
Archived from groups: alt>os>linux (more info?)

Hi!

I'm trying to make a local copy of my homepage with photos:
www.pbase.com/piotrstankiewicz

More precisely I try to make a copy of this site with all child pages
starting with www.pbase.com/piotrstankiewicz (which is not a problem) and
additionnaly I want to download all the photos used. Unfortunately here I
have a problem as the photos are placed on other servers from the domain
pbase.com.

I tried something like that:

wget -r -H -p --convert-links --no-parent -l
3 --html-extension -Dpbase.com --exclude-domains
forum.pbase.com,search.pbase.com http://www.pbase.com/piotrstankiewicz

Unfortunately it doesn't work.

I don't know why wget tries to download directories above
www.pbase.com/piotrstankiewicz and it starts downloading ex.
www.pbase.com/register.html (it looks like the option --no-parent doesn't
work or I don't fully understand it's behavior). When is starts to download
the documents from the main forder, it continues and continues and it wants
do download the contents of all the server. Sad

How to force the situation that all the pages which doesn't start with
www.pbase.com/piotrstankiewicz (with the exception of photos placed on other
servers) are ignored?

I tried to use the option -I /piotrstankiewicz,/piotrstankiewicz/image In
such a situation all the web pages are downloaded ok (in the way I expect)
but wget doesn't download any photos (ex. it ignores that photo
http://i5.pbase.com/o4/43/588543/1/60342441.SA0469_20na30_final.jpg )

I tried also to use the option --exclude-directories /galleries,/help,/login
(without -I option) as there are the links from the
www.pbase.com/piotrstankiewicz site directing to the structure above but
wget it ignores and it starts to download the contents of all the site.

Any ideas?


---------------------------------------------------

http://www.pbase.com/piotrstankiewicz
Back to top
noi
External


Since: Oct 19, 2004
Posts: 614



PostPosted: Thu Dec 28, 2006 8:45 am    Post subject: Re: wget - the question for advanced users [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

On Wed, 27 Dec 2006 14:32:59 +0100, Piotr wrote this:

> Hi!
>
> I'm trying to make a local copy of my homepage with photos:
> www.pbase.com/piotrstankiewicz
>
> More precisely I try to make a copy of this site with all child pages
> starting with www.pbase.com/piotrstankiewicz (which is not a problem) and
> additionnaly I want to download all the photos used. Unfortunately here I
> have a problem as the photos are placed on other servers from the domain
> pbase.com.
>
> I tried something like that:
>
> wget -r -H -p --convert-links --no-parent -l 3 --html-extension
> -Dpbase.com --exclude-domains forum.pbase.com,search.pbase.com
> http://www.pbase.com/piotrstankiewicz
>
> Unfortunately it doesn't work.
>
> I don't know why wget tries to download directories above
> www.pbase.com/piotrstankiewicz and it starts downloading ex.
> www.pbase.com/register.html (it looks like the option --no-parent doesn't
> work or I don't fully understand it's behavior). When is starts to
> download the documents from the main forder, it continues and continues
> and it wants do download the contents of all the server. Sad
>
> How to force the situation that all the pages which doesn't start with
> www.pbase.com/piotrstankiewicz (with the exception of photos placed on
> other servers) are ignored?
>
> I tried to use the option -I /piotrstankiewicz,/piotrstankiewicz/image In
> such a situation all the web pages are downloaded ok (in the way I expect)
> but wget doesn't download any photos (ex. it ignores that photo
> http://i5.pbase.com/o4/43/588543/1/60342441.SA0469_20na30_final.jpg )
>
> I tried also to use the option --exclude-directories
> /galleries,/help,/login (without -I option) as there are the links from
> the www.pbase.com/piotrstankiewicz site directing to the structure above
> but wget it ignores and it starts to download the contents of all the
> site.
>
> Any ideas?
>
>
> ---------------------------------------------------
>
> http://www.pbase.com/piotrstankiewicz


Did you try the --mirror option?

Start over use the --mirror option for your homepage. Enable log file
then maybe you can see what Wget is doing.
Back to top
Piotr
External


Since: Dec 27, 2006
Posts: 5



PostPosted: Fri Dec 29, 2006 12:28 am    Post subject: Re: wget - the question for advanced users [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

> Did you try the --mirror option?
>
> Start over use the --mirror option for your homepage. Enable log file
> then maybe you can see what Wget is doing.

It doesn't work as I would expect. Even if I use -m option wget starts to
download documents from the main directory / and I don't know why. Sad(

Ex. It downloads www.pbase.com/login.html

Even if I reject that page with -R option, wget follows the links from the
login.html page. Sad


---------------------------------------------------

http://www.pbase.com/piotrstankiewicz
Back to top
root
External


Since: Sep 02, 2004
Posts: 76



PostPosted: Fri Dec 29, 2006 7:50 am    Post subject: Re: wget - the question for advanced users [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

Piotr <piotr.stankiewicz.foto RemoveThis @wp.pl> wrote:
>> Did you try the --mirror option?
>>
>> Start over use the --mirror option for your homepage. Enable log file
>> then maybe you can see what Wget is doing.
>
> It doesn't work as I would expect. Even if I use -m option wget starts to
> download documents from the main directory / and I don't know why. Sad(
>
> Ex. It downloads www.pbase.com/login.html
>
> Even if I reject that page with -R option, wget follows the links from the
> login.html page. Sad
>
>
> ---------------------------------------------------
>
> http://www.pbase.com/piotrstankiewicz
>
>

There may be a link in one of the directories to the main page
of the site.
Back to top
noi
External


Since: Oct 19, 2004
Posts: 614



PostPosted: Fri Dec 29, 2006 8:19 am    Post subject: Re: wget - the question for advanced users [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

On Fri, 29 Dec 2006 00:28:01 +0100, Piotr wrote this:

>> Did you try the --mirror option?
>>
>> Start over use the --mirror option for your homepage. Enable log file
>> then maybe you can see what Wget is doing.
>
> It doesn't work as I would expect. Even if I use -m option wget starts to
> download documents from the main directory / and I don't know why. Sad(
>
> Ex. It downloads www.pbase.com/login.html
>
> Even if I reject that page with -R option, wget follows the links from the
> login.html page. Sad
>
>
> ---------------------------------------------------
>
> http://www.pbase.com/piotrstankiewicz

Hmmm sorry I'm not following you too well.
You want to follow the links for sub directory html and photos but exclude
parent directories? If you've run wget and have every thing then you
can delete what you don't need right ?

Following the links should download pages linked to your html even when
they link to parent domains such as the login.html I think?

Anyway I think you should try the --mirror with --exclude to obtain
the website clone that you want.

Also, a link on wget for following links if you haven't read it yet:

http://www.gnu.org/software/wget/manual/wget.html#Following-Links
Back to top
Piotr
External


Since: Dec 27, 2006
Posts: 5



PostPosted: Fri Dec 29, 2006 4:11 pm    Post subject: Re: wget - the question for advanced users [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

> There may be a link in one of the directories to the main page
> of the site.

Yes, such links exists but I want to force wget not to follow them.

I used -R option but it doesn't work as I expected. The documents from the
main directories are deleted but wget follows the links in them. The problem
is that the documents in the main directory have so many links that I can't
and I don't want download them (I would download the contents of all server
which means terribly, terribly huge amounts of data).

I tried to use -I option. In such situation wget downloadts the html pages I
want but it skips the photos placed on other servers. Sad(

I tried also the option -X. In such situation the photos are downloaded but
dependning on what I wrote after -X wget starts to download the contents of
all server or wget downloads only one html document.

I have impression that wget is not able to fulfill the task I want to
realise but I wrote my problem here hoping that maybe someone will find a
nice trick.


---------------------------------------------------

http://www.pbase.com/piotrstankiewicz
Back to top
Piotr
External


Since: Dec 27, 2006
Posts: 5



PostPosted: Fri Dec 29, 2006 4:11 pm    Post subject: Re: wget - the question for advanced users [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

> Hmmm sorry I'm not following you too well.
> You want to follow the links for sub directory html and photos but exclude
> parent directories? If you've run wget and have every thing then you
> can delete what you don't need right ?

No I can't. When wget follows the links from the documents in the main
directory, it would download the contents of all server and it means it will
download terribly, terribly huge amounts of data.

> Also, a link on wget for following links if you haven't read it yet:
>
> http://www.gnu.org/software/wget/manual/wget.html#Following-Links

Yep, I've read manual many times. But still I can't catch how to solve my
problem.


---------------------------------------------------

http://www.pbase.com/piotrstankiewicz
Back to top
noi
External


Since: Oct 19, 2004
Posts: 614



PostPosted: Fri Dec 29, 2006 4:51 pm    Post subject: Re: wget - the question for advanced users [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

On Fri, 29 Dec 2006 16:11:31 +0100, Piotr wrote this:

>> Hmmm sorry I'm not following you too well. You want to follow the links
>> for sub directory html and photos but exclude parent directories? If
>> you've run wget and have every thing then you can delete what you don't
>> need right ?
>
> No I can't. When wget follows the links from the documents in the main
> directory, it would download the contents of all server and it means it
> will download terribly, terribly huge amounts of data.
>
>> Also, a link on wget for following links if you haven't read it yet:
>>
>> http://www.gnu.org/software/wget/manual/wget.html#Following-Links
>
> Yep, I've read manual many times. But still I can't catch how to solve my
> problem.
>
>
> ---------------------------------------------------
>
> http://www.pbase.com/piotrstankiewicz


Try the no-parent option otherwise you'll have resolve the link problem
using multiple wgets or a rewrite of your index.html.

-np
--no-parent
no_parent = on
The simplest, and often very useful way of limiting directories is disallowing retrieval of the links that refer to the hierarchy above than the beginning directory, i.e. disallowing ascent to the parent directory/directories.

The --no-parent option (short -np) is useful in this case. Using it guarantees that you will never leave the existing hierarchy. Supposing you issue Wget with:

wget -r --no-parent http://somehost/~luzer/my-archive/


You may rest assured that none of the references to
/~his-girls-homepage/ or /~luzer/all-my-mpegs/ will be followed. Only
the archive you are interested in will be downloaded. Essentially,
--no-parent is similar to -I/~luzer/my-archive, only it handles
redirections in a more intelligent fashion.
Back to top
Piotr
External


Since: Dec 27, 2006
Posts: 5



PostPosted: Sun Dec 31, 2006 12:03 am    Post subject: Re: wget - the question for advanced users [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

> Try the no-parent option otherwise you'll have resolve the link problem
> using multiple wgets or a rewrite of your index.html.
>
> -np
> --no-parent
> no_parent = on

I've tried and it looks like that wget ignore that option in case of the
site I want to download. I don't know if it's a wget bug or some strange
trick in the construction of pbase.com or if some parametrs conflict. Sad


---------------------------------------------------

http://www.pbase.com/piotrstankiewicz
Back to top
Display posts from previous:   
Post new topic   General Reply to Topic (not reply to a specific post)    Forums Home -> General Discussions (archive 2) All times are: Eastern Time (US & Canada) (change)
Page 1 of 1

 
You can post new topics in this forum
You can reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum