Crawl rubenvarela.com¶
- Get all results, even if okay (
--verbose
)
By default it will only return values that are errors or warnings.
You can omit the--verbose
argument if you want. - Store the results in XML format,
linkchecker-out.xml
In [141]:
!linkchecker -F xml http://rubenvarela.com --check-extern --verbose
In [142]:
import xmltodict
from random import sample as randSample
In [143]:
with open('linkchecker-out.xml') as f:
obj = xmltodict.parse(f.read())
- Explore the object.
- Get the main element and see it's content.
- Get 5 random elements and print them in order to see the objects.
In [144]:
for x in obj['linkchecker']:
print x
# print type(obj['linkchecker'])
# print obj['linkchecker']['@created']
# print type(obj['linkchecker']['urldata'])
randset = randSample(obj['linkchecker']['urldata'], 5)
randset
# print obj['linkchecker']['urldata'][0]
# print obj['linkchecker']['urldata'][1]
# print obj['linkchecker']['urldata'][2]
# print obj['linkchecker']['urldata'][3]
# print obj['linkchecker']['urldata'][99]
# print obj['linkchecker']['urldata'][100]
# print obj['linkchecker']['urldata'][101]
# print ''
# print obj['linkchecker']['urldata'][100]['parent']['#text']
Out[144]:
Now to extract links with errors:¶
- For each URL,
- If the result is not okay,
- Check if the valid and parent keys exists.
- If they do, then print the values.
- If the necessary keys don't exist, print the object.
In [145]:
for x in obj['linkchecker']['urldata']:
# print x['valid']['@result']
if not x['valid']['@result'] == "200 OK" :
if 'valid' and 'parent' in x: #if keys exists in x
print "Result Code: " + x['valid']['@result']
print "Link location: " + x['parent']['#text']
print "Line: " + x['parent']['@line']
else:
print "Couldn't find all elements:"
print x
print "Links to: " + x['url']
In [ ]: