+ Reply to Thread
Results 1 to 4 of 4

Break out paragraphs in PDF via Power Query in excel

  1. #1
    Forum Contributor
    Join Date
    02-13-2018
    Location
    USA
    MS-Off Ver
    Office 365
    Posts
    208

    Break out paragraphs in PDF via Power Query in excel

    Hello,

    attached is a workbook that is using power query to import a sample PDF contract. I am trying to find a way (either in power query or via excel formulas) to break out all the sections as they appear on the PDF, so in paragraph form. In column R on the SampleContract-Shuttle pdf sheet I have a concatenate formula, but that just is a join that works for sentences. I am looking for a formula or a way in power query to section off the PDF in paragraphs just like it is on the actual contract. Both files are attached. Anyone have idea? Thx
    Attached Files Attached Files

  2. #2
    Forum Expert
    Join Date
    08-17-2007
    Location
    Poland
    Posts
    2,545

    Re: Break out paragraphs in PDF via Power Query in excel

    I don't think the task is doable with the current layout of the data received. Take a look at rows 9 and 32. Duplicates have been created, although they do not occur multiple times in the PDF document. There are more such duplicates in the received data.
    This is a difficult task. I will admit that I have never come across a converter that fully reproduces the source document.
    At the moment I am just working on reading PDF documents using VBA. Admittedly, I'm at the beginning of this road, but there is some hope in being able to retrieve the so-called envelope of each word (or word bounding box), i.e. the position of the words (or more precisely, the rectangles in which the text is written) on each page of the document. Knowing the position of the words, you can combine them into sentences and paragraphs. Unfortunately, you need Adobe Acrobat Pro to read the envelope. In the attachment I show the result of such a reading of your PDF document. After reading, the data were sorted to group them by the pages of the document and then by the positions of the top edges of the rectangles. The page numbers are counted from 0, not 1. Also note that the positioning of objects on the page starts at the bottom-left corner of the page, not at the top-left corner as in Excel. The "¶" sign is a graphic sign of moving to a new line. As you will be able to see by examining the attachment, it is not always the case that words are inserted in a logical order (as we see in the document). The order of insertion is reflected in the Idx column.

    Artik
    Attached Files Attached Files

  3. #3
    Forum Expert
    Join Date
    08-17-2007
    Location
    Poland
    Posts
    2,545

    Re: Break out paragraphs in PDF via Power Query in excel

    And it worked out quite well.
    I changed the sorting order. Now Page, Bottom, Left.
    As I mentioned I haven't found the perfect converter so far. And this one isn't either. Rows 72 and 73 should be in one paragraph.

    Artik
    Attached Files Attached Files

  4. #4
    Forum Contributor
    Join Date
    02-13-2018
    Location
    USA
    MS-Off Ver
    Office 365
    Posts
    208

    Re: Break out paragraphs in PDF via Power Query in excel

    I have decided to use MS power automate flows instead of power query to get the PDF data extractions, therefore am marking this thread as solved.

+ Reply to Thread

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Similar Threads

  1. Power Query- Break data from one sheet into new queries?
    By Anita Knapp in forum Excel General
    Replies: 2
    Last Post: 04-04-2024, 03:11 PM
  2. Replies: 5
    Last Post: 10-31-2023, 11:12 AM
  3. [SOLVED] Add Line Break via Power Query
    By KerahJoy in forum Excel Programming / VBA / Macros
    Replies: 5
    Last Post: 03-28-2023, 10:05 AM
  4. [SOLVED] Power Query - excel formula translation into Power Query
    By afgi in forum Excel Charting & Pivots
    Replies: 7
    Last Post: 02-19-2020, 03:38 AM
  5. Replies: 4
    Last Post: 02-17-2020, 06:03 AM
  6. Excel Power Query Refresh or Access Query - 2nd Query Run is faster
    By Steveapa in forum Excel Programming / VBA / Macros
    Replies: 3
    Last Post: 01-03-2020, 10:16 AM
  7. Sub-Forum for Excel Power Tools (Power Query, Power Pivot & Power BI)
    By chullan88 in forum Suggestions for Improvement
    Replies: 10
    Last Post: 06-28-2018, 02:25 PM

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

Search Engine Friendly URLs by vBSEO 3.6.0 RC 1