linux - Downloading all Urls accessible under a given domain with wget without saving the actual pages? -


trying determine valid urls under given domain without having mirror site locally.

people want download pages want list of direct urls under given domain (e.g. www.example.com),

  • www.example.com/page1
  • www.example.com/page2
  • etc.

is there way use wget this? or there better approach this?

ok, had find own answer:

the tool use httrack.

httrack -p0 -r2 -d www.example.com 
  • the -p0 option tells scan (not save pages);
  • the -rx option tells depth of search
  • the -d options tells stay on same principal domain

there -%l add scanned url specified file doesn't seem work. that's not problem, because under hts-cache directory can find tsv file named new.txt containing urls visited , additional information it. extract urls following python code:

with open("hts-cache/new.txt") f:     t = csv.dictreader(f,delimiter='\t')     l in t:         print l['url'] 

Comments

Popular posts from this blog

c# - How Configure Devart dotConnect for SQLite Code First? -

c++ - Clear the memory after returning a vector in a function -

erlang - Saving a digraph to mnesia is hindered because of its side-effects -