Downloading files using Selenium and Apache HttpClient

The Selenium framework has become the standard for web browser automation, but at the time of writing this article, Selenium WebDriver doesn’t include a built-in method for downloading files from the web.

Problem Description

Consider the following scenario:
You’re writing an automation script that’s supposed to perform a login to a website (using a username and password), and then you’re transferred to the next web page which has a link for downloading a PDF document (call it: my_report.pdf). To download the file, your script would need to be able to perform the following steps:

  1. Enter a username and password into the relevant input fields and click the “Submit” button.
  2. On the next page, click the “Download Report” link to begin downloading the my_report.pdf file.
  3. Select a location on your local hard drive for saving the PDF document, and finally download the file to that location for some further processing.

Many web applications require an active session on the server before they allow any further actions, such as the download of user-specific content. Step 1 in the example above is given to emphasize the possibility that you would also need to manage session cookies in order to get a valid download link.

Selenium provides us all the tools for accomplishing steps 1 and 2, but the problem arises when you need to implement step 3: Selenium doesn’t come with a built-in method for downloading a file (rather than getting it opened inside the browser). Selenium doesn’t support the download of files that are referenced by an HTML <a> tag such as:
<a href=”my_report.pdf”>Download Report</a>, or for that matter, the download of any other file that’s available on a website, such as image files referenced by
<img src=”my_pic.png” /> tags.

Various workarounds have been devised for automated downloading of files from the web, such as using AutoIt or Sikuli in concert with Selenium. But each method has its own drawbacks.

The Solution

This tutorial shows how we can combine Selenium methods with the Apache HttpClient library to download files from a website – giving us a pure-Java solution to the problem. We’ll be using Selenium to get the download link URL, and then using the Apache library to send an HTTP GET request to that URL, to actually download the file.

The steps that we’ll implement:

  1. Get the URL of the file to be downloaded (using Selenium)
  2. Transform the Selenium set of cookie objects to its Apache HttpClient equivalent (CookieStore).
  3. Generate an HTTP client object and set its CookieStore.
  4. Generate an HTTP GET request with the URL of the file to be downloaded and execute the request using the HTTP client object.
  5. Capture the HTTP response and write its contents to a file on the local hard drive.

The Code

1. Get the URL of the file to be downloaded (using Selenium)

Assuming we’re planning to download my_report.pdf that’s referenced on the web page by:

Download Report

The Selenium code to get the URL of the file to be downloaded would be:

driver = new FirefoxDriver();
driver.get("www.example.com");
WebElement downloadLink = driver.findElement(By.linkText("Download Report"));
String fileUrl = downloadLink.getAttribute("href");

 2. Transform the Selenium set of cookie objects to its Apache HttpClient equivalent (CookieStore)

If we know that for a successful download of the file, that we have to maintain the session that was established by Selenium till now (like in a case that a username/password authentication was required in the previous steps), we need to extract the cookies that are kept in Selenium’s WebDriver object, and transform them to Apache’s CookieStore object that we’ll include in the subsequent HTTP GET request. We’ll define the following method:

private CookieStore seleniumCookiesToCookieStore() {

	Set<Cookie> seleniumCookies = driver.manage().getCookies();
	CookieStore cookieStore = new BasicCookieStore();

	for(Cookie seleniumCookie : seleniumCookies){
		BasicClientCookie basicClientCookie =
			new BasicClientCookie(seleniumCookie.getName(), seleniumCookie.getValue());
		basicClientCookie.setDomain(seleniumCookie.getDomain());
		basicClientCookie.setExpiryDate(seleniumCookie.getExpiry());
		basicClientCookie.setPath(seleniumCookie.getPath());
		cookieStore.addCookie(basicClientCookie);
	}

	return cookieStore;
}

3. Generate an HTTP client object and set its CookieStore

CookieStore cookieStore = seleniumCookiesToCookieStore();
DefaultHttpClient httpClient = new DefaultHttpClient();
httpClient.setCookieStore(cookieStore);

4. Generate an HTTP GET request with the URL of the file to be downloaded, and execute the request using the HTTP client object

HttpGet httpGet = new HttpGet(downloadUrl);
System.out.println("Downloding file form: " + downloadUrl);
HttpResponse response = httpClient.execute(httpGet);

5. Capture the HTTP response and write its contents to a file on the local hard drive

HttpEntity entity = response.getEntity();
if (entity != null) {
	File outputFile = new File(outputFilePath);
	InputStream inputStream = entity.getContent();
	FileOutputStream fileOutputStream = new FileOutputStream(outputFile);
	int read = 0;
	byte[] bytes = new byte[1024];
	while ((read = inputStream.read(bytes)) != -1) {
		fileOutputStream.write(bytes, 0, read);
	}
	fileOutputStream.close();
	System.out.println("Downloded " + outputFile.length() + " bytes. " + entity.getContentType());
}
else {
	System.out.println("Download failed!");
}

Putting it all together:

package download.example;

import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.util.Set;

import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.CookieStore;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.BasicCookieStore;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.impl.cookie.BasicClientCookie;
import org.openqa.selenium.By;
import org.openqa.selenium.Cookie;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.firefox.FirefoxDriver;

public class DownloadFileTest {

	static WebDriver driver;

	public static void main(String[] args) throws Exception {
		driver = new FirefoxDriver();
		driver.get("www.example.com");
		WebElement downloadLink = driver.findElement(By.linkText("Download Report"));
		String fileUrl = downloadLink.getAttribute("href");
		downloadFile(fileUrl, "C:\\temp\\my_report.pdf");
	}

	public static void downloadFile(String downloadUrl, String outputFilePath) throws Exception {

		CookieStore cookieStore = seleniumCookiesToCookieStore();
		DefaultHttpClient httpClient = new DefaultHttpClient();
		httpClient.setCookieStore(cookieStore);

		HttpGet httpGet = new HttpGet(downloadUrl);
		System.out.println("Downloding file form: " + downloadUrl);
		HttpResponse response = httpClient.execute(httpGet);

		HttpEntity entity = response.getEntity();
		if (entity != null) {
			File outputFile = new File(outputFilePath);
			InputStream inputStream = entity.getContent();
			FileOutputStream fileOutputStream = new FileOutputStream(outputFile);
			int read = 0;
			byte[] bytes = new byte[1024];
			while ((read = inputStream.read(bytes)) != -1) {
				fileOutputStream.write(bytes, 0, read);
			}
			fileOutputStream.close();
			System.out.println("Downloded " + outputFile.length() + " bytes. " + entity.getContentType());
		}
		else {
			System.out.println("Download failed!");
		}
	}

	private static CookieStore seleniumCookiesToCookieStore() {

		Set<Cookie> seleniumCookies = driver.manage().getCookies();
		CookieStore cookieStore = new BasicCookieStore();

		for(Cookie seleniumCookie : seleniumCookies){
			BasicClientCookie basicClientCookie =
					new BasicClientCookie(seleniumCookie.getName(), seleniumCookie.getValue());
			basicClientCookie.setDomain(seleniumCookie.getDomain());
			basicClientCookie.setExpiryDate(seleniumCookie.getExpiry());
			basicClientCookie.setPath(seleniumCookie.getPath());
			cookieStore.addCookie(basicClientCookie);
		}

		return cookieStore;
	}
}

View Rony Byalsky's LinkedIn profileView Rony Byalsky’s profile

Fork me on GitHub