Data extraction

Extract data contained in a PDF

Sometime data are contained in a PDF, which makes a pain in the ass to extract them in a csv file. Luckily, some tools exist to ease this process.

Plain text

Tables

  • Tabula: it requires you to do it “by hand”, but so far it is the most accurate tool I’ve used. Perfectly suited for a small number of tables to extract and/or if you have a lot of time 🙂
  • Tabulizer: this R package allows for a pure R implementation, but the extractions were really messy

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.