python - Beautiful Soup throws `IndexError` -
i scraping website using python 2.7 , beautiful soup 3.2. new both languages, documentation got bit started.
i reading next documentations: http://www.crummy.com/software/beautifulsoup/bs3/documentation.html#contents http://thepcspy.com/read/scraping-websites-with-python/
what , have (part fails):
# import classes needed import urllib2 beautifulsoup import beautifulsoup # url scrape , open urllib2 url = 'http://www.wiziwig.tv/competition.php?competitionid=92&part=sports&discipline=football' source = urllib2.urlopen(url) # turn saced source beautifulsoup object soup = beautifulsoup(source) # source html page, search , store <td class="home">..</td> , it's content hometeamstd = soup.findall('td', { "class" : "home" }) # loop through tag , store needed information, being home team hometeams = [tag.contents[1] tag in hometeamstd] # source html page, search , store <td class="home">..</td> , it's content awayteamstd = soup.findall('td', { "class" : "away" }) # loop through tag , store needed information, being away team awayteams = [tag.contents[1] tag in awayteamstd]
content of tag.contents
hometeamstd looks this:
[ [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'harkemase boys', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-6077" />], [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'rkc waalwijk', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-427" />], [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'dutch knvb beker', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-6758" />], [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'psv', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-3" />], [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'ajax', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-2" />], [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'dutch knvb beker', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-6758" />], [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'sc heerenveen', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-14" />], [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'feyenoord', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-9" />], [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'dutch knvb beker', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-6758" />] ]
content of tag.contents
awayteamstd looks this:
[ [u'away-team'], [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-13" />, u'nec', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />], [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-11" />, u'heracles', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />], [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-428" />, u'stormvogels telstar', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />], [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-419" />, u'fc volendam', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />], [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-7" />, u'fc twente', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />], [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-415" />, u'fc dordrecht', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />] ]
problems try solve, don't quite yet is:
- the code
awayteams = [tag.contents[1] tag in awayteamstd]
throughs error:indexerror: list index out of range
. ofcourse correct, because can see in output oftag.contents
awayteamstd, there first entry[u'away-team']
. why failing. how can remove/skip one? - within hometeams output working, exclude ones text dutch knvb beker occurs
the problem "away" cell (column name) inside td "away" class:
<thead class="title"> ... <tr class="sub"> ... <td>home-team</td> <td></td> <td class="away">away-team</td> <td class="broadcast">broadcast</td> </tr> </thead> </thead>
just skip using slicing:
awayteamstd = soup.findall('td', { "class" : "away" })[1:]
also, if want exclude dutch knvb beker
list of home teams, add condition list comprehension expression:
hometeams = [tag.contents[1] tag in hometeamstd if tag.contents[1] != 'dutch knvb beker']
Comments
Post a Comment