所用で英文テキストが欲しかったので、lxmlで英文テキストを取ってくるスクリプトを作成。汎用性はまったく無いけど、とりあえず公開。
#!/usr/bin/env python
#! -*- coding: utf-8 -*-
import sys
import urllib2
import lxml.html
def get_texts(url, tag):
try:
html = urllib2.urlopen(url).read()
except urllib2.HTTPError, e:
# If returned 404
#print e.read()
print "Can't access to the given URL."
return
root = lxml.html.fromstring(html)
anchors = root.xpath(tag)
text_list = []
for a in anchors:
text = lxml.html.tostring(a, method='text', encoding='utf-8')
#print text.strip('\t\n')
text_list.append(text.strip('\t\n'))
if not text_list:
text = "There are no tags in this page"
text_list.append(text)
return text_list
def create_xpath(tag, attr):
xpath = "//"
xpath += tag
if not attr == "":
xpath += "[@"
attr,value = attr.split("=")
xpath += attr
xpath += "=\"" + value + "\"]"
print "Created Xpath: " + xpath
return xpath
if __name__ == "__main__":
argv_len = len(sys.argv)
if not (argv_len == 3 or argv_len == 4):
print "Usage: python tag-getter.py URL TAG [ATTR]"
exit()
url = sys.argv[1].lower()
tag = sys.argv[2].lower()
if argv_len == 4:
attr = sys.argv[3].lower()
else:
attr = ""
xpath = create_xpath(tag, attr)
text_list = get_texts(url, xpath)
for t in text_list:
print t
Usage Example:
$ python tag-getter.py http://ebooks.adelaide.edu.au/c/carroll/lewis/alice/chapter1.html div class=dochead
Created Xpath: //div[@class="dochead"]
Alice in Wonderland, by Lewis Carroll
$ python tag-getter.py http://ebooks.adelaide.edu.au/c/carroll/lewis/alice/chapter1.html p
Created Xpath: //p
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice
she had peeped into the book her sister was reading, but it had no pictures or conversations in it, ‘and what is the use of
a book,’ thought Alice ‘without pictures or conversation?’
... 後略 ...

0 件のコメント:
コメントを投稿