linux - Downloading all Urls accessible under a given domain with wget without saving the actual pages? -

- June 15, 2011

trying determine valid urls under given domain without having mirror site locally.

people want download pages want list of direct urls under given domain (e.g. www.example.com),

www.example.com/page1
www.example.com/page2
etc.

is there way use wget this? or there better approach this?

ok, had find own answer:

the tool use httrack.

httrack -p0 -r2 -d www.example.com

the -p0 option tells scan (not save pages);
the -rx option tells depth of search
the -d options tells stay on same principal domain

there -%l add scanned url specified file doesn't seem work. that's not problem, because under hts-cache directory can find tsv file named new.txt containing urls visited , additional information it. extract urls following python code:

with open("hts-cache/new.txt") f:     t = csv.dictreader(f,delimiter='\t')     l in t:         print l['url']

Search This Blog

SSIS

linux - Downloading all Urls accessible under a given domain with wget without saving the actual pages? -

Comments

Post a Comment

Popular posts from this blog

c# - How Configure Devart dotConnect for SQLite Code First? -

erlang - Saving a digraph to mnesia is hindered because of its side-effects -

java - Copying object fields -