Unstructured Data and the Data Graveyard


Just in the past few weeks, I’ve spoken with several customers about their Data Graveyard – the place where data goes to die and is never seen, again.

That’s, right, they have systems in-place that seem to do what they need but they can’t get any data out of them. Here in the year 2016 and we still focus our entire implementations on making sure the systems collect every bit of data and no time on how to get that data out. We used to call LIMS the Data Graveyard, but with the proliferation of acronyms, we now find plenty of people with ELNs and other types of Data Graveyards, as well.

I also heard someone talking about “unstructured data” and this is where I think we’re still having a problem and I will go so far as to say this: When you put unstructured data into your system, you are building a Data Graveyard.

You have also probably heard the phrase “garbage in garbage out” to indicate that you only get out of your system what you put into it.

An Issue
For those of you doing research who are reading this, some of you possibly getting a bit steamed-up by what I said and thinking that you need your data to be unstructured so that you can do your work without restrictions.

Let me correct you on this – what you need isn’t “unstructured” but “flexible” and here’s one place we still have a problem. If you start tossing unstructured data into a system you’re not going to be able to find it. If you make it not special in any way then you have not helped your searching abilities to find.

Think of it, this way – at home, when you throw lots of things into your garbage can, suppose you realize you accidentally threw away something important – so important that you need to retrieve it. Possibly, you stick your arm in the can, aimlessly rooting-around for what you need. Or, maybe you dump the can onto some newspaper to pick through everything. Regardless, the entire task is a nasty one where you’re pawing through everything to find what you need. That’s “unstructured.”

The bottom line is this: if you don’t find some way to help categorize data, possibly some keyword, and if you make no effort at all to make give some structure to finding that data, you won’t find it. No-one I know goes back in to label things, after-the-fact. And, like the garbage can, there will be a LOT in there to paw through – finding anything won’t be easy.

So, please, force every implementer to make your process flexible but still give your data enough structure that you can find it when you need it.

The Bottom Line
If the customers don’t force the implementers of software to do a better job of being able to get data back out, they just won’t do it. For the software vendors who think that that isn’t their job, they’re wrong. It’s their job to implement the system not just to manage the process but to allow data to be retrieved from it. It’s just that it’s hard and no-one wants to do that – sometimes neither the customer nor the vendor.

And to the people who claim that they can’t do this because they don’t know what they’ll want in the future, I say that that’s not true. It’s hard but not impossible.

But, here we are in 2016 and still talking about yet another of the same issues we’ve talked about for years and years. Will there be a change? I don’t know the answer to that.

More Reading
If you’re interested in reading more regarding the Data Graveyard, here are some past posts and articles:
LIMS: The Data Graveyard
LIMS – The Data Graveyard II
Laboratory Informatics Silos and Data Graveyards

Gloria Metrick
GeoMetrick Enterprises
http://www.GeoMetrick.com/


3 responses on “Unstructured Data and the Data Graveyard

  1. Unstructured data is not nearly as bad as it used to be. You can have unstructured data that has metadata that is auto-populated and associated with the unstructured data. Google is the master at retrieving data from the largest unstructured garbage can in the world… the internet. We need google inside the LIMS and in particular, the eln.

    • Adding metadata means that there is at least some structure. In addition, while we might love our Google searches for most things, they’re not sufficient when we need to make sure we find everything, such as a search for data to protect a patent, to organize all the information about a product, and other similar searches.

      In Google, we’re usually so overwhelmed with results that we’re almost glad we’re not getting all of them. In science, that’s not the case.

      Also, while those searches get better all the time, “better” is still not sufficient.

  2. ACD/Labs provides solutions that database raw data files (unstructured or structured) in readable formats (not images or pdfs), meaning they can be searched using any of the criteria contained in the data file itself, not just a numerical final result or metadata. Currently around 145+ data formats are supported out of the box. The system automatically reads new data files from the instrument directory and uploads them into a relational database that is interfaced with existing LIMS and/or ELN system records. Users then search the database using data itself, peaks or signals of interest, structures, predicted signals or a combination of all the above instead of using queries. Results will be reported in a hit table, each record pointing to associated LIMS or ELN record that fit the search criteria.

    Information can always be added to provide additional structure to the data but is not required. The ability of the system to accomplish this is thanks to partnerships ACD/Labs has with virtually all of the major instrument vendors found in analytical labs, a level of cooperation that no LIMS or ELN providers have reached thanks to competition and/or the sheer reluctance to keep up with the various proprietary formats. With my past experience working for a LIMS vendor, I can say it is mostly the latter. Either way, this provides a great way of searching databases full of unstructured data, using the information users are most comfortable with, data itself.

Leave a Reply

Your email address will not be published. Required fields are marked *