Power Query / Get & Transform – Combine PDF files in folder

Important intro.

If your Excel version offers the Get Data > From File > From PDF experience (underlined in blue below):

you don’t need this article to combine several PDF files. Choose option From Folder and follow the wizard (Para. 1 in this article – amongst other – nicely pictures & explains the process)

Option From File > From PDF is missing

First thing you must do: Follow article Power Query / Get & Transform – Get data from PDF missing and if you get You CAN NOT import Tables from PDF files… you can close this article and look for alternatives to do what you expect

Setting expectations

Each scenario is quite unique and when importing data, several factors can lead to unexpected results, starting with the data source - PDF files in this case

Consequently, you cannot expect this article to be THE SOLUTION to each scenario. Most of the time, when Combining data from several files/sources some additional transformation steps are required, and they can't be covered here

This article and its solution workbook take you to the point where Pages or Tables from several PDF files are combined as a single Table

IMHO there’s a few things that must be understood before combining a series of PDFs otherwise risk is high that the unexpected will happen. Below I explain how to combine all Pages or Tables from a sample PDF file

Question or issue

If you have question(s)/need clarification about this article, feel free to reply. Otherwise, if you have issue with specific transformation steps:

- Browse this forum articles

- Search this forum and others for existing solutions

- Ask a new question on this forum

Combine Pages or Tables from one PDF

Download Sample_File1.pdf and store it an empty folder. For the next step it’s assumed the folder path is C:\Users\wxyz\EmptyFolder

Open a new Excel workbook > Go to Data (tab) > Get Data > From Other Sources > Blank Query and in the formula bar enter:

=Pdf.Tables(File.Contents("C:\Users\wxyz\EmptyFolder\Sample_File1.pdf"), [Implementation = "1.3"])

Power Query language is case sensitive. Enter the formula exactly as above

Pages or Tables?

When importing a PDF Power Query builds a Table (as show above) with all the Tables and Pages the PDF connector detected in the file. In Sample_File1.pdf 2 Tables and 3 Pages are detected – whatever the reason is

Question could be: What is the source of the issue (assuming there’s one)? The software/tool that generated the PDF or the PDF connector?

(my) Answer: That doesn’t matter. This is a fact, and I must deal with it if I want to complete the task - cf. Setting expectations, additional transformation steps will be required to get something that makes sense

The [Data] field/column consists of tables, each containing the rows imported by the connector for each Page and Table:

Clicking on one "cell" in [Data] we get a preview of the Table at the bottom of the screen (it can be resized as shown above). But scrolling each Table can be tedious, especially with large files. In addition to making a quick review of the tables, we can ask Power Query to count the number of rows in each table

NB: Previewing Page001 we see 9 columns. If you click on Page002 or Page003 in [Data] you’ll see 5 columns only - something that will need to be addressed with additional transformation steps if we decide to combine all Pages…

Count Rows & Columns in a Table

With the above table in front of us:

Go to Add Column (tab) > Custom Column > Give it a meaningful name (COUNT_ROWS below)

Under Custom column formula, type Table.RowCount( > Double-click on Data in Available columns > close the ) > OK

Power Query language is case sensitive. Enter the formula exactly as above

Add another Custom column COUNT_COLUMNS with Custom column formula: Table.ColumnCount([Data])

This helps to take the right decision (Pages vs Tables), isn’t it?

NB: if you click on Table002 in [Data] and scroll down in the preview you’ll see that the above row/record is missing

Filter Page or Table

In this scenario we don’t want to miss the last row/record so we will combine the [Kind] Page:

Click on the arrow down icon next to the [Kind] field > Uncheck Table > OK

Combine all [Data] tables

Before combining the tables (in [Data]) we must remove some columns now:

Go to Home (tab) > Choose Columns > Uncheck: Id, Kind, COUNT_ROWS and COUNT_COLUMNS > OK

Click on the expand button next to the [Data] column

Uncheck Use original column name as prefix (not a must but this makes life easier)

Important: Power Query says the "List may be incomplete" (I won’t explain why here) although the 9 columns (in the table for Page001) are listed. Now assume the max. [COLUMN_COUNT] was 15. If we don’t click on Load More to force Power Query to update the list, we will expand 9 columns only… So click Load More

Alternative

With very large files time to update the List can take time. If the order of the rows/records - in the result you expect - doesn’t matter you can Sort the [COUNT_COLUMNS] in Descending order, before removing the unnecessary columns

There’s also a way to build that List dynamically - not covered here

That’s it, the end of the Combining process. All remaining transformation steps are yours/mine to perform as they’re specific to each scenario

How To - Use the solution workbook

It’s quite easy (I hope):

Download the solution workbook and save it on your hard drive

Open it in Excel

Update the 3 required parameters in sheet QUERY Params:

Go to Data (tab) > Queries & Connections (the corresponding pane opens)

NB: Query Combined_PDFs_InFolder contains a single Table with all the Pages or Tables from the PDF files found in the folder you specified

Right-click on Combined_PDFs_InFolder > Reference (the Power Query Editor opens)

Perform the additional required – if any - transformation steps

Close & Load…

Was this article helpful?

Sorry this didn't help.

Great! Thanks for your feedback.

How satisfied are you with this article?

Thanks for your feedback, it helps us improve the site.

How satisfied are you with this article?

Thanks for your feedback.

 

Forum Article Info


Last updated December 1, 2022 Views 3,003 Applies to: