Danny Guo | 郭亚东

Automating TurboTax Data Entry With Puppeteer

  ·  1,649 words  ·  ~8 minutes to read

I use Betterment for some of my investments. For my 2018 tax return, I wanted to use the free edition of Intuit’s TurboTax Online. For non-retirement account investments, each transaction must be documented in the tax return with Form 8949. I had hundreds of transactions in my account in 2018, and I didn’t want to enter them manually.

Betterment has a TurboTax integration to do this work automatically, but TurboTax doesn’t allow importing investment transactions unless you have the Premier plan or above. I also couldn’t import a TXF file because that feature is only available in the desktop versions of TurboTax. I didn’t want to pay for the Premier plan, in part because of Intuit’s lobbying efforts against a simpler tax filing system and because of their other questionable practices.

This seemed like a great opportunity to try out Puppeteer. It’s a Node.js library for controlling Chrome and is developed by the Chrome team.

In the end, I probably spent more time automating the process than it would have taken to just enter the transactions manually, but I learned a new tool, and I hope to reuse the script in the future. It’s available on GitHub with the name BrokerScribe. Here’s a demo of the end result (running at half speed):

Parsing the CSV file

First, I downloaded the 1099-B CSV file from Betterment’s tax forms page.

Betterment tax forms

After removing an extraneous comma at the end of the header row, I used xsv to look at the available columns.

$ xsv headers betterment-1099-b.csv
1   Account
2   Description
3   Symbol
4   CUSIP
5   Date Acquired
6   Date Sold
7   Gross Proceeds
8   Cost or Other Basis
9   Gain/(Loss)
10  Wash Sale Loss Disallowed
11  Federal Income Tax Withheld
12  Type of Gain(Loss)
13  Noncovered Securities
14  Reporting Category

Then I inspected the values to get an idea of what the data looks like.

$ xsv table betterment-1099-b.csv | xsv sample 3
Account                  Description                                                   Symbol  CUSIP      Date Acquired  Date Sold   Gross Proceeds  Cost or Other Basis  Gain/(Loss)  Wash Sale Loss Disallowed  Federal Income Tax Withheld  Type of Gain(Loss)  Noncovered Securities  Reporting Category
Danny's Taxable Account  0.004034 sh. Vanguard Total International Bond ETF Class O    BNDX    92203J407  07/09/2018     09/28/2018  $0.22           $0.22                $0.00        $0.00                      $0.00                        Short-term          No                     A
Danny's Taxable Account  0.697454 sh. Vanguard FTSE Developed Markets Class O          VEA     921943858  10/05/2015     08/20/2018  $29.66          $25.75               $3.91        $0.00                      $0.00                        Long-term           No                     D
Danny's Taxable Account  11.978683 sh. Vanguard FTSE Developed Markets Class O         VEA     921943858  11/16/2015     07/30/2018  $526.10         $443.93              $82.17       $0.00                      $0.00                        Long-term           No                     D

I used neat-csv to read in the data.

const fs = require('fs');
const csv = require('neat-csv');

(async () => {
    let transactions = fs.readFileSync('betterment-1099-b.csv');
    transactions = await csv(transactions);
})();

Using Puppeteer

I launched Puppeteer in non-headless mode so that the browser UI would actually appear and be controllable from outside of the script. It launches a bundled version of Chromium (which Chrome is built on). It’s possible to control an existing installation of Chrome, but each version of the library is only guaranteed to work with its bundled browser.

I also turned off the default viewport so that resizing the window would also resize the viewport.

const puppeteer = require('puppeteer');

(async () => {
    const browser = await puppeteer.launch({
        defaultViewport: null,
        headless: false
    });
})();

Logging in

The first step in the TurboTax workflow was logging in.

TurboTax login page

const page = await browser.newPage();
await page.goto('https://myturbotax.intuit.com/');

I used prompts to interactively collect authentication info. I considered passing the info through environment variables, but I ran into two-factor authentication later, so I needed to be able to collect a dynamic code.

const {email, password} = await prompts([
    {
        type: 'text',
        name: 'email',
        message: 'Email address'
    },
    {
        type: 'text',
        name: 'password',
        message: 'Password',
        style: 'password',
    }
]);

I set the input values with JS after finding out how to pass a variable to evaluate.

await page.evaluate(email => {
    document.getElementById('ius-identifier').value = email;
    document.getElementById('ius-remember').checked = false;
}, email);

Next, I had to figure out how to handle waiting for the UI to be “ready” after hitting submit. In my experience, this part is a common source of flakiness for UI tests using tools like WebdriverIO and Nightwatch.js. I tried following the example in the documentation.

await Promise.all([
    page.waitForNavigation(),
    page.click('#ius-sign-in-submit-btn')
]);

Unfortunately, I kept getting this error: TimeoutError: Navigation Timeout Exceeded: 30000ms exceeded. At first, I tried playing around with different waitForNavigation options, but then I realized that the site was built as a single-page application. While Puppeteer does treat History API usage as a navigation event, the site doesn’t use the History API. It merely updates the form on the page. I tried to wait for the existence of the password field instead using waitFor.

TurboTax password field

await Promise.all([
    page.waitFor(
        '#ius-sign-in-mfa-password-collection-current-password'
    ),
    page.click('#ius-sign-in-submit-btn')
]);

This solution worked. I filled in the password the same way that I filled in the email. The next part of the form was for two-factor authentication, using a code that was sent to my email. However, it only triggered some of the time, even after unchecking remember me on the first part of the log in form. This meant that I didn’t have a stable element to wait for. My original solution was to just put in a sleep and then check if the two-factor authentication form appeared.

await page.waitFor(5000); // in milliseconds

if ((await page.$('#ius-mfa-options-submit-btn')) !== null) {
    // collect the code, and submit it
}

However, I now realize that a much better solution would have been to use a selector with two elements, one for each possible path, with something like:

await page.waitFor('#ius-mfa-options-submit-btn, #dashboard');

This method avoids the antipattern of sleeping with a hardcoded duration, which slows things down and makes the code more brittle since the duration might not be long enough if the server is slow to respond, the client is on a poor internet connection, etc.

Entering transactions

After logging in, I had to enter transactions to fill this page.

empty Betterment sales page

I matched up each field on this form to the respective column in the CSV file.

empty transaction form

I continued to set the values directly with in-page JS.

await page.evaluate(transaction => {
    document.getElementById('edt_00').value = transaction.description;
    document.getElementById('edt_01').value = transaction.dateSold;
    document.getElementById('edt_02').value = transaction.dateAcquired;
    document.getElementById('edt_03').value = transaction.grossProceeds;
    document.getElementById('edt_04').value = transaction.costBasis;

    let category;
    if (transaction.reportingCategory === 'D') {
        category = '4';
    } else if (transaction.reportingCategory === 'A') {
        category = '1';
    } else {
        throw new Error('Unexpected reporting category');
    }

    document.getElementById('combo_00').value = category;
}, transaction);

But I ran into validation errors, even though I could see the values in the fields before trying to continue. I received an error message, and the fields were blanked out.

form validation errors

I suspected that the form has validation that runs as the user enters data. By setting the value directly, the validation was bypassed, and it thought the user had left the fields blank. After reading through this issue and this question, I found a better approach, which was to use the type and select methods instead of setting values with JS. Using these methods fixed the issue.

await page.type('#edt_00', transaction.description);
await page.select('#combo_00', category);

After submitting the transaction, there was a radio selection to choose if I wanted to submit another one. I tried clicking the “enter a new transaction” radio with Puppeteer’s click method.

await page.click('#radio_00\\:0');

But this time, the more ergonomic approach didn’t work. After hitting continue, the page thought I didn’t make a choice. I tried clicking the radio with JS, and I was able to continue.

await page.evaluate(() => {
    document.querySelector('#radio_00\\:0').click();
});

The final issue I ran into was that I would get Error: Node is either not visible or not an HTMLElement when trying to fill in info on the new transaction page. My guess for the root cause is that the site animates some DOM changes, like the appearance of the form, most likely with CSS transitions. Puppeteer’s waitFor only detects the element’s existence in the DOM. It doesn’t know if the element is visible, if click handlers have been attached, etc. I added a one second sleep as a workaround. A future improvement would be to find a way to wait for the end of any transitions.

In the end, the script took about 30 minutes to fill in all my transactions. I checked several of them to verify that the information was transcribed correctly.

filled Betterment sales page

Conclusion

I have three main takeaways from my first experiment with Puppeteer. The first is that I should default to using the high-level API (e.g. type, select, click) first before resorting to injecting JS with evaluate. The high-level API is cleaner and better simulates a real human being using the website.

The second is that I need to be smarter about waiting for things to be ready rather than adding sleeps. I’ve resorted to sleeping quite often in my past end-to-end test code, but it’s something I need to minimize going forward.

The third is that handling websites that use fancier techniques like animations or are built as SPAs can be tricky. I found multiple issues with people having problems interacting with elements, and a common explanation is that the site’s JS hasn’t fully run yet.

Regardless, I see a lot of potential in using Puppeteer, especially for automated tests. The author of PhantomJS stopped development after headless Chrome was announced. Another big tool in this field is Selenium, but it’s not very reliable in my experience, harder to set up, and resource heavy. I’m optimistic that Puppeteer will prove to be a better option for most cases.

One concern is that Puppeteer only works for Chrome, but the Puppeteer developers appear to be working on experimental support for Firefox. Edge is also switching to Chromium, and most fringe browsers like Opera and Brave run on Chromium. For non-testing purposes, such as taking screenshots and automating workflows like BrokerScribe does, browser compatibility doesn’t really matter, and Puppeteer is shaping up to be a great alternative to Selenium.


← How to Add Copy to Clipboard Buttons to Code Blocks in Hugo

Follow me on Twitter or subscribe to my free newsletter or RSS feed for future posts.

Found an error or typo? Feel free to open a pull request on GitHub.


comments powered by Disqus