Dealing with large data

**dflak** · 07-03-2017, 05:21 PM

This is a spin-off on the large spreadsheet thread - I don't want to derail it too much.

One of my issues with one of my customers it that they often demand too much data. We have a report that tracks individual models (of which there are hundreds) and individual failure codes (of which there are hundreds - which is way too many IMHO). So that's tens of thousands of data points every day. What's more is that they want to see every day for a month and be able to "expand" the pivot table on any month for the past two years. So there's about 3.5 million pieces of data tracked on this one report. That's one huge haystack in which to be looking for needles.

Also the data are very variable. We might process 100 of a particular model one day, 300 the next, 50 the day following and then none the day after that. Weekly averages allow regression towards the norm. Which brings up the question: how much data is enough? How much is too much? How granular do you need to be?

We do try to use run charts with upper and lower control limits but they depend on a normal distribution which we don't have. We have growth (more people are buying the product) and seasonality - I can compensate for both of these, but the stuff coming down the pipe is "clumpy" - Stores wait until the box gets filled and then ship it to our shop. Also the inspection lines change constantly - the inspectors set up to test a particular model and they will test it all day long or for several days and then move onto another model until there's enough backlog to switch back to the first model.

Also models have a life cycle: they are introduced, everyone is happy for a couple of months. More models get bought and then they start coming back: first a trickle then a surge and then tapering off to the point where the really old timers come straggling in one or two a month. How far do I go back under these circumstances to compute the control limits? Too far back and they are static and do not reflect the current situation. Not far enough back and they are ephemeral.

In other words, it's an out of control system. But nonetheless, the run charts do occasionally show something drastic - and there is usually a good reason for it.

One of the metrics our user is fond of is period over period change rate such as (This Month - Last Month) / Last Month. So if you had 1 last month and have 2 this month, that's a 100% increase whereas if you had 100,000 last month and 150,000 this month, that's only a 50% increase. To compensate for this I have a sliding "sigma" on my charts. Low volume models get wide control margins. High volume models get narrower margins.

I push hard for a parateo analysis. Don't worry about the small stuff - concentrate on volume. Still some people want the details on everything.

**scottiex** · 07-03-2017, 05:59 PM

Some slightly disorganized thoughts,

In some situations we provide (because it is asked for) daily and hourly data, but the end users are not experts in statistics, and I think it inspires people to spend time watching the numbers and trying to react to statistically insignificant changes/correlations.

Also I find that with many users of the same report, there is a natural tendency for then to each ask for all the data they think they might need just in case. Once those items are included - generally no-one wants to reduce the amount of data, even if it starts to make things harder to use, because they don't know who might need that data and they don't want to be responsible for removing it. This means complexity grows.

Pareto analysis has some issues from the user perspective in that a lot of it is about risk avoidance. Doing the 80% well doesn't completely cover you when you get nailed for doing 20% badly - when you get called up the 80% will probably be assumed to be business as usual.

Dealing with large data

LinkBack

Thread Tools

Display

Dealing with large data

Re: Dealing with large data

Thread Information

Users Browsing this Thread

Similar Threads

Dealing with large data sets in excel

Looking for suitable processor for dealing with large spreadsheets

Dealing with extremely large numbers in Excel

League Tables / Dealing with Large Amounts of Data/ Duplicate Records

Dealing with large lists and assigning values

Tips for dealing with large data sets?

[SOLVED] Dealing with a large table and multiple calculations

Bookmarks

Bookmarks

Posting Permissions