Delete Records in a set after classification

**DannyGIS** · 05-23-2014, 07:46 PM

Hello,

I have a large (300K+ records) database with a sizeable amount of duplicate records. I want to delete the duplicates but this is not a matter of simply Remove Duplicates; I need to evaluate them before I do.

I am wondering what functions would:

1) select the specific record in a set of duplicates that makes a determination of a status
2) once the status has been determined for the set, delete all other records

Fields in my database:

ACCIDENT NUM (ID field, in text or General format)
DUP (for Duplicate, indicated by a character, for now its a "?")
OCC_KILLED (in Number format)
OCC_INJURED (in Number format)
SEVERITY (in text format)

Here are some scenarios:

ACC dup K I
12345 ? 0 0
12345 ? 1 2

Or:

ACC dup K I
123456 ? 0 1
123456 ? 1 0

Or:

ACC dup K I
1234567 ? 0 0
1234567 ? 0 2
1234567 ? 0 0

This is the formula for indicating if there are Duplicate records in the larger dataset:

=IF(OR(A2=A3,A2=A1),"?","")

I need to determine the Severity of the accident based on this:

If OCC_KILLED > 0 then SEVERITY = F (for Fatal)
IF OCC_INJURED > 0 and > OCC_KILLED then SEVERITY = I (for Injury)
IF OCC_KILLED and OCC_INJURED >= 0 then SEVERITY = F
IF OCC_KILLED and OCC_INJURED = 0 then SEVERITY = PDO (for Property Damage Only)

I have a code already in place for how to create the value for Severity but it DOES NOT account for duplicate records:

(in SEVERITY field):

=IF(A1<>0, "F",IF(B1<>0,"I","PDO")

sampleset.xls is a sample ot the database.

Any and all help will definitely be appreciated.

Dan B

**hamjam** · 05-27-2014, 09:47 AM

Hello Dan
It seems that the accident number is the unique identifier. The sample sheet is missing the severity column but you are saying that you have that sorted.
The item that is not clear is which record from the duplicate records you want to keep, i.e.
if the accident contained a fatality record, is that the record to keep and discard the rest OR
do you want to sum the numeric values of occ_killed and occ_injured and create this record and discard the rest

So is there a ranking when you have calculated the severity, i.e. from highest to lowest : fatality : injury : PDO : null

And is the spreadsheet sorted, the sample sheet looks more like a pivot table

Regards

**DannyGIS** · 05-27-2014, 01:49 PM

hamjam,

Thanks big time for getting back so fast.

You can put in a Severity field on your own when you start fooling around--make it a Text/String field so that the end result will be "K" (killed) or "I" (injured) or "PDO" (PDO).

The comment you make is EXACTLY what I need help with:

HOW DO I identify which record determines the Severity, and which records to delete based on the determination. I have thought of doing totals in the K and I fields but my project supervisor doesn't want to do it like that. Once there is a record that = Severity, then delete the rest, but--especially with more than one record--how to make sure the correct record is used and every other one is deleted. I have attached an example that ties into this; the formula scans the whole dataset but for some reason sometimes it worked and sometimes it didn't?

I was also wondering if its possible to create a Comma-Delimited file (maybe make this an index?) for all the sets and then create an Index for all the types of scenarios?

As in:

0,0,0,0,0,0 (Type a--3 sets of 0 killed, 0 injured)
0,1,0,1 (Type b -- 2 sets of 0 killed, 1 injured)
0,2,0,1 (Type c--same Accident Number, one record of 0 killed, 2 injured, one record of 0 killed, 1 injured)

and so on, with Totals of how many Type a, Type b, Type c...

Is it also possible to do all of these procedures at the same time?

1) determine severity
2) determine which records to delete (btw, this will either be done manually or with a VBA script)
3) create groups and numbers of the type of them

I didn't think of the table being a pivot table. Is this bad or good?

**hamjam** · 05-27-2014, 02:38 PM

Hello again
The issue does not seem to be with determining a severity classification - as you have specific rules to determine that.
The issue seems to be uncertainty around how to rank the severity classification from highest to lowest. If that is not available, you need to make a rule and then test it against the "project supervisor" or client. My suggestion was :
level 1 - fatality
level 2 - injury
level 3 - PDO
level 4 - null

So the implication is that you could take the record at the highest level and discard other duplicate records that are at an equal or lower level.
I was just not certain if you should consolidate the values, i.e. if there are 2 fatalities and 3 injuries over 5 records, should your final record be that single view or the consolidated view. Or is the important result the accident ID and severity classification and the rest of info is not really of value

Your previous formula addressed point 1
I can throw some code together for point 2, you just need to finalise the rule that will be applied by the code
I suggest a pivot table for point 3, or else simple formula such as =COUNTIF()

Regards

**DannyGIS** · 05-27-2014, 03:11 PM

Hamjam:

1) Yes, we do need to determine a Severity classification--the database consists of Traffic Collision values. There are no "null" end results from the formula--for our effort the only results are your levels 1-3 (PDO stands for Property Damage Only). I'm sorry, I probably should have attached a spreadsheet that indicates a formula for determining this, but there are some screwy results, as I pointed out in the jpg.

2) Yes, you are correct: Accident ID and Severity classification are one part of what is needed, but once this has been determined by the correct record, then indicate which ones are to be deleted (how? another field?)

3) Also, and then this should probably go first in this sequence, is to count the Type of scenario first, before determining Severity and duplicate records. My project supervisor is requesting this, and I sense COUNTIF might be a good way to go but I don't have enough experience with it--same with Pivot Tables, though either one might work.

THe attached table might help, but again, notice on the last set of 4 records how it should be an I and not a PDO (given that there is 1 Injury indicated).

Thanks again for all your help.

Dan

**DannyGIS** · 05-27-2014, 06:33 PM

Hamjam:

This is a simpler way of looking at the Grouping activity, but what I am (scared) of is the exponential numbers that may result (I am ultimately going to be dealing with a 300,000 record database:

ACCIDENT NUM K I GROUP

1234 0 1
1234 0 1 1

12345 0 0
12345 0 0
12345 0 0 2

123456 0 2
123456 0 1 3

1234567 0 1
1234567 0 1 1

What I am thinking is how do I do this:

1) Scan whole database and group each set of duplicates (based on ACCIDENT NUMBER)
2) Compare each set of duplicates

Group 1st set into Group A, Set Count to 1
If 2nd set SAME AS Group A then Add to Count
If 2nd set DIFFERENT to Group A then New Group B and Set Count = 1

And compare groups as we go.

Repeat same set of logic (comparing sets of duplicates) throughout whole database--either add to pre existing sets and add to count or create new set and create new count

It almost feels like a VBA macro, but I want to think I'm wrong too.

I hope this isn't quantum physics either.

Dan B

**hamjam** · 05-28-2014, 05:02 PM

GroupErrors2.xls
Hello there
I am not sure why you are concerned about the number of rows unless you are using Excel 2003 or earlier.

I would suggest that you sort the data in the database before the extract to Excel. Typically this could be done through SQL - ORDER BY ACCIDENT_NUM

I have written a macro attached to a button on the worksheet. The rule for the severity is contained in a separate function that returns the highest severity to be 3, then 2 and then 1. This is called from the routine to determine the score.
The code evaluates the current row and the row below it, to determine firstly if they are duplicate. If so, the severity is calculated and the lower severity row is deleted.

Have a look and see if that works for you. You can hit ALT and F11 to bring up the VBA screen and have a look at the code. I hope it is readable.

Regards

**DannyGIS** · 05-28-2014, 06:19 PM

Hamjam,

I am NOT concerned about the number of rows, rather, I am interested in identifying the patterns or Groups of duplicate records based on the duplicate ACCIDENT NUMBER.

As I illustrated above, each set of duplicate records and their pattern counts as one group. Within the database of sets of duplicate records, certain patterns of duplicate values (see above) repeat themselves. These patterns may consist of 3 sets (Killed/Injured) of the same numbers (0 and 0), or 2 sets of different numbers (0,1, 0, 2) but for the same Accident Number, or 3 sets of the same number not equalled to 0 (1 and 1). I need to identify Patterns and count how many times they happen.

My biggest fear is that given the size of the database and the THOUSANDS of duplicates, the number of Patterns is somewhat exponential. I remember reading how one Logical Function is limited to 127 scenarios, so I'd like to think that my database won't go that far.

I do thank you for the script and all of your efforts. I admit the counting of different patterns feels like DNA research, but please read my messages again--how unclear am I? Really?

Dan B

**DannyGIS** · 05-28-2014, 06:43 PM

Hamjam,

Thanks kindly for your Macro on the spreadsheet, but:

1) In each set of duplicate records based on Accident Number, how did the Macro know which records to delete if they met the criteria I indicated earlier?

2) Before Severity is determined, is there a way of identifying each pattern of the set of duplicate Accident Number's Killed/Injured values, and creating a label for each group? See my illustration on the first page of the Spreadsheet.

Dan B

**hamjam** · 05-29-2014, 04:08 AM

Hello
It seems that I missed the problem. I was working on a solution to remove duplicate rows, based on keeping the highest ranked severity. Then the remaining spreadsheet could be analysed further.

However, it seems that you are rather after categorising each row into a pattern. So the removal of duplicate rows does not follow a simple rule but rather if it is duplicate of an existing pattern. Also, you treat patterns as unique, not based on severity but on the value of the 2 variables. In this case, yes the more data, the greater potential number of patterns.

On way would be to pre-determine the number of patterns of the population of data, i.e. the total set. This could be done by =MAX(your range) for each column representing the variables. So say you have fatal column = 25 and injury column = 35 then the maximum possible patterns would be 26 x 36 = 936. The actual number of patterns would likely be less, as you are not going to have all the combinations present.
I would suggest to create a 2 dimensional array consisting of 3 variables, a pattern number, the fatal value and injury value. So you could populate the array given the MAX function above. Once done, you could examine each row and obtain the pattern number for the combination of the 2 variables from the row and add this pattern number to a new column.

After the completion, you could do a pivot table showing the count of each pattern number.

Good luck

**DannyGIS** · 05-29-2014, 02:13 PM

Hamjam,

For the Nth time; PLEASE: The only time I'm CALCULATING VALUES is the number of times a Pattern happens (The Count).

But you are at least hinting at something:

How would I create a Pattern Number? A Pattern Number is more exactly what I am trying to do. Again, for the Nth Time, and remember, all values below are in STRING format:

ACCIDENT NUM KILLED INJURED PATTERN COUNT

12345 0 0
12345 0 0
12345 0 0 1 1

123456 0 0
123456 0 0 2 1

12345 0 0
12345 0 0 2 2

123456 0 0
123456 0 0
123456 0 0 1 2

123444 1 1
123444 1 1 3 1

12345 0 0
12345 0 0 2 3

How would I create a Pattern Number (PATTERN FIELD)?

In other words, Pattern 1 = 0,0 (3x), Pattern 2 = 0,0 (2x), Pattern 3 = 1,1 (2x), and so on...and Count = SUM of Number of times for Pattern 1, Pattern 2, Pattern 3....

How could I use GROUP to Group the values in KILLED and INJURED and have it equal Pattern 1, Pattern 2, etc...?

(Count is something I'll need but its probably simple enough, I would think, Yes? TOTAL of Patterns in Pattern Field, like =SUM("1",B2:B5, SUM("2",B2:B5), etc.

Maybe what you are suggesting with the arrays and what not might be useful but I'm not seeing it.

Let's first tackle how to group values together then Count the groups as they go.

Dan

Delete Records in a set after classification

LinkBack

Thread Tools

Rate This Thread

Display

Delete Records in a set after classification

Re: Delete Records in a set after classification

Re: Delete Records in a set after classification

Re: Delete Records in a set after classification

Re: Delete Records in a set after classification

Re: Delete Records in a set after classification

Re: Delete Records in a set after classification

Re: Delete Records in a set after classification

Re: Delete Records in a set after classification

Re: Delete Records in a set after classification

Re: Delete Records in a set after classification

Thread Information

Users Browsing this Thread

Similar Threads

[SOLVED] need to be able to update records with button click and delete records sample included

[SOLVED] How to delete automatically records?

Delete these records

delete records with unique value

[SOLVED] Delete unique records

Tags for this Thread

Bookmarks

Bookmarks

Posting Permissions