Extract data from formatted text file

**ahartman** · 02-12-2010, 12:12 PM

Hi everyone. I think the situation that I have is doable, but I'm asking for guidance so I don't muddle through it w/ a pieced together solution. Also, this is a long post, so thank you in advance to those who take the time to read it all.

I have a text file containing data that I want to analyze. The text file actually has whitespace with headers and footers such that the data is spread across multiple "pages". That is, you can print the text file and each printed page has similar header/footer information, and then a list of data w/ each piece of data on a new line. Sort of like this:

Applesauce Name Date Official Stuff

---
data1
data2
data3

Bananas Page:1
Time: ##:##:##

Applesauce Name Date Official Stuff2

---
data4
data5
data6

Bananas Page:2
Time: ##:##:##

.
.
.
etc

I've got no problem importing the whole text file, but I want to know how to remove all the whitespace and header/footer stuff and just retain the pieces of data in between.

These are the things I know about all the text files in general:

-Each "page" has 66 total lines, including header, data, whitespace, and footer.

-Each "page" may have a minimum of 1 and maximum of 52 data lines.

-The text that always marks the beginning of the headers is: "Applesauce" (I've used a nonsense word to protect confidential data for this post).

-The text that usually marks the end of the header is: "---" and will be on line 11 if present.

-If the "page" does not include the text "---", data will start on line 7.

-If the "page" has the text "Cucumber", data will start on line 12.

-The text that always mark the beginning and end of the footers are: "Bananas" and "Time: ##:##:##" respectively, and happen on lines 65 and 66, respectively.

-I can know the first page's ##:##:## time values from the filename if I need to, but each subsequent "page" can get a slightly different time value depending on how long it took the source program to create the data file.

-The number of pages isn't known ahead of time, but is possible to get because the header info includes "Page: #" for each sequential page.

-The total number of lines of data isn't known in advance.

There's got to be enough there to program a macro to extract the data from the other stuff, but I'm not sure what the right approach is.

Some thoughts off the top of my head:

-After import, dividing the total number of imported rows by 66 will give the number of pages.

-Looking for "---" in each group of 66 rows will tell if that group's data starts on line 7.

-Looking for "Cucumber" in each group of 66 rows will tell if that group's data starts on line 12.

What's the right way to handle this task? Should I copy each block of 66 rows to a new sheet and then run the same filtering logic against each sheet? Or, should I just keep the imported lines on one sheet and process each block of 66 rows one after another?

Advise or suggestions would be greatly appreciated, as always!

**Richard Buttrey** · 02-12-2010, 12:53 PM

Hi,

This is a common requirement in all the corporate environments I've come across. The general technique I developed, run by a macro, is as follows.

After importing the text file add a new helper column A, (occasionally two or three helper columns may be called for).

Add a formula in A1 such that the formula is able to work out whether the row in question contains banner headlines, page headers, or column labels that you probably want to exclude. The formulas will be different for each text file of course but generally follow the sort of pattern:

Please Login or Register  to view this content.

the trick is to examine each relevant line you want to ignore and find some unique aspect of it that you can code in the IF(OR()) test. I generally create the formula first in Excel and play around with it, copying it down the rows until I'm happy that I can use it in the macro code.

Once you have the formula get the macro to add it to column A, filter the whole of the range for the word 'Delete' in column A and delete all these rows, finally removing the Autofilter.

In my previous company we had about 20 such text files that needed stripping of data so I created a general purpose macro which contained the 20 relevant formula and this could then be run by anyone just by selecting the appropriate report name from a front menu.

HTH

Extract data from formatted text file

LinkBack

Thread Tools

Rate This Thread

Display

Extract data from formatted text file

Re: Extract data from formatted text file

Thread Information

Users Browsing this Thread

Tags for this Thread

Bookmarks

Bookmarks

Posting Permissions