linux - Downloading all Urls accessible under a given domain with wget without saving the actual pages? -
trying determine valid urls under given domain without having mirror site locally.
people want download pages want list of direct urls under given domain (e.g. www.example.com
),
www.example.com/page1
www.example.com/page2
- etc.
is there way use wget
this? or there better approach this?
ok, had find own answer:
the tool use httrack.
httrack -p0 -r2 -d www.example.com
- the -p0 option tells scan (not save pages);
- the -rx option tells depth of search
- the -d options tells stay on same principal domain
there -%l add scanned url specified file doesn't seem work. that's not problem, because under hts-cache directory can find tsv file named new.txt containing urls visited , additional information it. extract urls following python code:
with open("hts-cache/new.txt") f: t = csv.dictreader(f,delimiter='\t') l in t: print l['url']
Comments
Post a Comment