Downloading files with multiple extensions on a webpage using mechanize library in Python -


my first question on stack overflow!

i'm trying download resumes of job posting website. i've found link leads download, downloads have '.php' ending, , hence don't know extension of file going downloaded (.doc, .docx, .pdf)the relevant last section of link looks this: ("~/resumedownload.php?f=wfeilbbzwg==")

i'm logging website mechanize. i've used mechanize login website, , download file:

filename = br.retrieve(link.get('href'), os.path.expanduser("~/desktop/job postings/hirist/" + str(i) + ".pdf"))[0] 

, brings .pdf files , corrupts rest. filename variable .php file.

any suggestions?

browser.retrieve() returns tuple consisting of filename file written , headers remote server. can use content-type header determine mime type of file , mimetypes module appropriate extension file. finally, rename file.

import mechanize import shutil import os.path import mimetypes  #url = 'http://stackoverflow.com' url = 'http://heriverde.nimoz.pl/wp-content/uploads/pdf-sample.pdf' br = mechanize.browser() filename, headers = br.retrieve(url)  dest_dir = os.path.expanduser('~/desktop/job postings/hirist/') # content-type may include encoding, e.g. text/html; charset=utf-8 content_type = headers.get('content-type', '').split(';')[0] extension = mimetypes.guess_extension(content_type) if not extension:     extension = '.dunno'  # `i` assumed counter dest_filename = '{}{}'.format(i, extension) shutil.move(filename, os.path.join(dest_dir, dest_filename)) 

Comments

Popular posts from this blog

javascript - Karma not able to start PhantomJS on Windows - Error: spawn UNKNOWN -

Nuget pack csproj using nuspec -

c# - Display ASPX Popup control in RowDeleteing Event (ASPX Gridview) -