Important intro.
If your Excel version offers the Get Data > From File > From PDF experience (underlined in blue below):
you don’t need this article to combine several PDF files. Choose option From Folder and follow the wizard (Para. 1 in this article – amongst other – nicely pictures & explains the process)
Option From File > From PDF is missing
First thing you must do: Follow article Power Query / Get & Transform – Get data from PDF missing and if you get You CAN NOT import Tables from PDF files… you can close this article and look for alternatives to do what you expect
Setting expectations
Each scenario is quite unique and when importing data, several factors can lead to unexpected results, starting with the data source - PDF files in this case
Consequently, you cannot expect this article to be THE SOLUTION to each scenario. Most of the time, when Combining data from several files/sources some additional transformation steps are required, and they can't be covered here
This article and its solution workbook take you to the point where Pages or Tables from several PDF files are combined as a single Table
IMHO there’s a few things that must be understood before combining a series of PDFs otherwise risk is high that the unexpected will happen. Below I explain how to combine all Pages or Tables from a sample PDF file
Question or issue
If you have question(s)/need clarification about this article, feel free to reply. Otherwise, if you have issue with specific transformation steps:
- Browse this forum articles
- Search this forum and others for existing solutions
- Ask a new question on this forum
Combine Pages or Tables from one PDF
Download Sample_File1.pdf and store it an empty folder. For the next step it’s assumed the folder path is C:\Users\wxyz\EmptyFolder
Open a new Excel workbook > Go to Data (tab) > Get Data > From Other Sources > Blank Query and in the formula bar enter:
=Pdf.Tables(File.Contents("C:\Users\wxyz\EmptyFolder\Sample_File1.pdf"), [Implementation = "1.3"])
Power Query language is case sensitive. Enter the formula exactly as above
Pages or Tables?
When importing a PDF Power Query builds a Table (as show above) with all the Tables and Pages the PDF connector detected in the file. In Sample_File1.pdf 2 Tables and 3 Pages are detected – whatever the reason is
Question could be: What is the source of the issue (assuming there’s one)? The software/tool that generated the PDF or the PDF connector?
(my) Answer: That doesn’t matter. This is a fact, and I must deal with it if I want to complete the task - cf. Setting expectations, additional transformation steps will be required to get something that makes sense
The [Data] field/column consists of tables, each containing the rows imported by the connector for each Page and Table:
Clicking on one "cell" in [Data] we get a preview of the Table at the bottom of the screen (it can be resized as shown above). But scrolling each Table can be tedious, especially with large files. In addition to making a quick review of the tables, we can ask Power Query to count the number of rows in each table
NB: Previewing Page001 we see 9 columns. If you click on Page002 or Page003 in [Data] you’ll see 5 columns only - something that will need to be addressed with additional transformation steps if we decide to combine all Pages…
Count Rows & Columns in a Table
With the above table in front of us:
Go to Add Column (tab) > Custom Column > Give it a meaningful name (COUNT_ROWS below)
Under Custom column formula, type Table.RowCount( > Double-click on Data in Available columns > close the ) > OK
Power Query language is case sensitive. Enter the formula exactly as above
Add another Custom column COUNT_COLUMNS with Custom column formula: Table.ColumnCount([Data])
This helps to take the right decision (Pages vs Tables), isn’t it?
NB: if you click on Table002 in [Data] and scroll down in the preview you’ll see that the above row/record is missing
Filter Page or Table
In this scenario we don’t want to miss the last row/record so we will combine the [Kind] Page:
Click on the arrow down icon next to the [Kind] field > Uncheck Table > OK
Combine all [Data] tables
Before combining the tables (in [Data]) we must remove some columns now:
Go to Home (tab) > Choose Columns > Uncheck: Id, Kind, COUNT_ROWS and COUNT_COLUMNS > OK
Click on the expand button next to the [Data] column
Uncheck Use original column name as prefix (not a must but this makes life easier)
Important: Power Query says the "List may be incomplete" (I won’t explain why here) although the 9 columns (in the table for Page001) are listed. Now assume the max. [COLUMN_COUNT] was 15. If we don’t click on Load More to force Power Query to update the list, we will expand 9 columns only… So click Load More
Alternative
With very large files time to update the List can take time. If the order of the rows/records - in the result you expect - doesn’t matter you can Sort the [COUNT_COLUMNS] in Descending order, before removing the unnecessary columns
There’s also a way to build that List dynamically - not covered here
That’s it, the end of the Combining process. All remaining transformation steps are yours/mine to perform as they’re specific to each scenario
How To - Use the solution workbook
It’s quite easy (I hope):
Download the solution workbook and save it on your hard drive
Update the 3 required parameters in sheet QUERY Params:
Go to Data (tab) > Queries & Connections (the corresponding pane opens)
NB: Query Combined_PDFs_InFolder contains a single Table with all the Pages or Tables from the PDF files found in the folder you specified
Right-click on Combined_PDFs_InFolder > Reference (the Power Query Editor opens)
Perform the additional required – if any - transformation steps