python - Beautiful Soup throws `IndexError` -


i scraping website using python 2.7 , beautiful soup 3.2. new both languages, documentation got bit started.

i reading next documentations: http://www.crummy.com/software/beautifulsoup/bs3/documentation.html#contents http://thepcspy.com/read/scraping-websites-with-python/

what , have (part fails):

# import classes needed import urllib2 beautifulsoup import beautifulsoup  # url scrape , open urllib2 url = 'http://www.wiziwig.tv/competition.php?competitionid=92&part=sports&discipline=football' source = urllib2.urlopen(url)  # turn saced source beautifulsoup object soup = beautifulsoup(source)  # source html page, search , store <td class="home">..</td> , it's content hometeamstd = soup.findall('td', { "class" : "home" }) # loop through tag , store needed information, being home team hometeams = [tag.contents[1] tag in hometeamstd]  # source html page, search , store <td class="home">..</td> , it's content awayteamstd = soup.findall('td', { "class" : "away" }) # loop through tag , store needed information, being away team awayteams = [tag.contents[1] tag in awayteamstd] 

content of tag.contents hometeamstd looks this:

[     [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'harkemase boys', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-6077" />],     [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'rkc waalwijk', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-427" />],     [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'dutch knvb beker', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-6758" />],     [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'psv', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-3" />],     [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'ajax', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-2" />],     [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'dutch knvb beker', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-6758" />],     [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'sc heerenveen', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-14" />],     [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'feyenoord', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-9" />],     [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'dutch knvb beker', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-6758" />] ] 

content of tag.contents awayteamstd looks this:

[     [u'away-team'],      [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-13" />, u'nec', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />],      [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-11" />, u'heracles', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />],      [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-428" />, u'stormvogels telstar', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />],      [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-419" />, u'fc volendam', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />],     [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-7" />, u'fc twente', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />],     [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-415" />, u'fc dordrecht', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />] ] 

problems try solve, don't quite yet is:

  • the code awayteams = [tag.contents[1] tag in awayteamstd] throughs error: indexerror: list index out of range. ofcourse correct, because can see in output of tag.contents awayteamstd, there first entry [u'away-team']. why failing. how can remove/skip one?
  • within hometeams output working, exclude ones text dutch knvb beker occurs

the problem "away" cell (column name) inside td "away" class:

<thead class="title">     ...     <tr class="sub">       ...         <td>home-team</td>       <td></td>       <td class="away">away-team</td>       <td class="broadcast">broadcast</td>     </tr>   </thead> </thead> 

just skip using slicing:

awayteamstd = soup.findall('td', { "class" : "away" })[1:] 

also, if want exclude dutch knvb beker list of home teams, add condition list comprehension expression:

hometeams = [tag.contents[1] tag in hometeamstd if tag.contents[1] != 'dutch knvb beker'] 

Comments

Popular posts from this blog

c# - How Configure Devart dotConnect for SQLite Code First? -

java - Copying object fields -

c++ - Clear the memory after returning a vector in a function -