safarisync
view safarisync/safarisync.py @ 14:811bb2e2ed2f
Fix the logging type on a debug message. Disable the use of urlretrieve because of a problem on python 2.6 and windows. Turn on binary file mode, since it's needed on windows.
| author | Douglas Mayle http://douglas.mayle.org |
|---|---|
| date | Fri Feb 27 21:27:49 2009 +0000 (17 months ago) |
| parents | 17f2ae3f60d8 |
| children |
line source
1 #!/usr/bin/env python
3 # HTML text to DOM library
6 # Net and url based tools
10 # To cleanup our book titles so that they can be used as filenames
13 # Tools for working with files and directories
16 # Module that allows us to prompt for a password without echoing
19 # Standard logging module
22 # A regex for selecting out the characters that are invalid and replacing them.
23 # TODO Checkout putting Unicode equivalent characters instead...
26 # A mapping between logging strings and logging levels.
33 # The default logging level for this program
36 # The list of input values necessary to request pdf generation
49 """Monkey patch the standard library modules to keep session cookies."""
50 # If you need to handle cookies in python, you have to monkey patch the
51 # libraries used to fetch files. The most common libraries used for this
52 # purpose are urllib and urllib2 (thankfully, they're consolidated in
53 # Python 3, but we're not there yet...)
56 # If you are having problems with your cookies, it will be useful to setup
57 # an LWP Cookie jar, which allows us to inspect cookies in a human readable
58 # format. Use the following code instead.
59 #
60 # from cookielib import LWPCookieJar
61 # global cj
62 # cj = LWPCookieJar()
63 #
64 # opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
69 # We depend on urllib2 to perform our patching, so we know it's here
74 # If the program has urllib loaded, we'll patch that, as well.
79 # Ideally, we'd also like to patch lxml so that we can use it's built in
80 # facilities with cookies, as well, but lxml sometimes uses urllib and
81 # sometimes uses libxml's web facilities, which we can't patch. You have
82 # to be aware of this and work around the limitations.
85 """Login to the Safari website to load a session cookie."""
99 # For the purpose of this script, we assume success, so we don't care about
100 # the result. We really should verify this, though.
104 # lxml uses urllib by default for downloads. Since we've patched it for
105 # cookie support, this is sufficient for our needs.
109 """Connect to Safari to retrieve the data, and then download any files not
110 on the local disk. Request any unavailable PDFs if necessary."""
111 # Read and parse downloads
115 # Get the list of table headers
116 headers = [header.text_content().strip().lower() for header in doc.cssselect("table.Content th")]
118 # In order to be a bit more resilient to changes in the document, we'll try
119 # to find the information we care about in the table.
125 # We store a one based index for css selection
129 logging.critical("Unable to find download metadata for these categories: '%s' from headers:\n%s" % \
134 ###################################################
135 # Some helper functions for extracting cell content
136 ###################################################
138 "Get the href of the first a node, or return an empty string"
140 # This dance cleans up some technically valid, but useless links like '#'
145 "Return the text content of the cell, strip leading and trailing whitespace."
148 # Because of a bug either in lxml, or the libraries it depends on
149 # (libxml, libxslt), it reads the utf-8 document and treats it as
150 # if it were latin-1. We fix the the encoding mistake.
154 # This probably means that the text was properly decoded and it
155 # contains characters not valid in the latin-l set. We'll let this
156 # pass.
159 # Just in case this bug exists only on my system, we'll ignore it
160 # if the 'fix' doesn't work.
163 # We'll keep track of whether or not we requested pdfs so that we can print
164 # a helpful error message
167 # Extract a list or table row elements, each one containing data about one
168 # download
177 continue
179 # Safari book downloads don't have a section text.
181 progress_message = "Handling Section '%s' of Book '%s'" % (get_text(sectioncell), get_text(titlecell))
203 "requested from Safari, so please rerun this after generation is " \
207 """Download the file from the given link, and save it to the specified filepath"""
212 # The error is raised even if the directory already exists. If so, we
213 # ignore the error.
216 return
218 # urllib.urlretrieve
219 #urllib.urlretrieve(link, filepath)
225 """Turn a list of path elements into a path, while sanitizing the characters"""
229 """Submit a PDF generation request. This is now an AJAX only interface, so
230 we hack it instead of connecting to a web page to fill out the form."""
241 "Request a user and password, taking into account data from the command line."
267 help='Change the logging level of this application. Possible choices are "%s".' % ', '.join(LOGLEVELS.keys()))
268 # This normally won't be used unless someone is debugging the html
269 # scraping. In that case, it saves the effort of supplying the user and
270 # password and connecting to the server.
278 # We only allow a set list of log levels. If the one supplied is bogus,
279 # use the default, but notify the users in case that means we've munged
280 # some other parameter.
282 logging.error("Invalid log level '%s', defaulting to '%s'" % (options.loglevel, DEFAULT_LOGGING))
295 # Set up urllib2 to keep cookies.
