Exploring strategies for cleaning messy data
Thanks to the efforts over the past few decades of the open government
community, a large and
hard-won
group of government datasets has been collected and
made publicly available. It’s inspiring to
look up from the day-to-day grind of opening up government data to see how much
progress has already been made. Now, though, we must bear the burden of our
collective success, and recognize that we’ve created an unruly menagerie of data
sources with many related, but unrelatable, datasets.
At Sunlight, this means that
we’re consolidating many of the related but separate projects that have sprung up over the years. We’re applying all that we’ve learned from
the dozens of projects we’ve done to
provide a unified experience. The public should not have to search a dozen
different databases in order to find what the information they seek. Just as no
man is an island, information cannot have meaning outside the context of its
collection and environment. We aim to provide fast, easy and meaningful context
to government affairs.
Over the past year, we’ve been working on taming these messy data by testing and
validating new ways of moving and
representing data. As we’ve been figuring out how
to effectively consolidate our data, we find ourselves facing the same problem
time and time again. It’s a basic issue that runs deep, seemingly without any
easy fix: The datasets we collect don’t have reliable identifiers associated
with each person or organization mentioned in the data. There is nothing
equivalent to a social security number that allows data collectors to reference
the same entity across datasets (or even consistently within
the same dataset).
Bootstrapping authority
We must act as curators, creating reliable identifiers ourselves, making
decisions about which identifiers each piece should get and managing those
identifiers in the face of changes to the content and format of incoming
data. We’re forced to move beyond finding, liberating and publishing data. We
must use all the data we have to provide context for every piece of data we
have. There is no authority on the data as whole, so we’re
forced to rely on ourselves and start up the process from scratch.
Thankfully, we are not the only ones who’ve had problems such as these. As long
as there have been databases, there have been database integrity problems. As we
started Googling around, we ran across field after field, specialization after
specialization, tool after tool that seek to redress every variation of the
above problem we could
imagine. Entity resolution,
record linkage,
householding and many other
academic fields were all created to address this issue. Background checks,
counterterrorism efforts and fraud analysis all depend on these techniques to
find the important data hiding in the mountains of messy data. The U.S. Census
Bureau has been using advanced statistical techniques for decades to make
sense of the data it collects. In short, as we researched these issues, we found
ourselves in interesting, varied and, frankly, unexpected company.
What’s next
Although it will still be several months before we can point to projects where
we use these techniques, we’ve ran across enough interesting ideas,
projects and efforts that we feel compelled to share some of the things we’ve
found. From talking with others in the open government community, we know that
others have felt our pain and are looking for their own solutions. Our solution
surely won’t be the same as everyone else’s, but each of solutions will likely
all share some common traits.
Over the summer, we’ll be blogging about research, companies and problems we’ve
come across in our work in entity resolution that we’ve found especially
interesting. The issues are necessarily technical, but we aim to keep the
explanations from being overly technical. We aim to build a lighthouse of ideas
for others trapped in the confusing fog of messy data. No one should have to
navigate the stormy seas of government data alone — and we hope that these posts
will help you find your way to wherever you are headed.
The Sunlight Foundation is a non-profit, nonpartisan organization that uses the power of the Internet to catalyze greater government openness and transparency, and provides new tools and resources for media and citizens, alike.
Source: http://sunlightfoundation.com/blog/2015/05/19/exploring-strategies-for-cleaning-messy-data/
Anyone can join.
Anyone can contribute.
Anyone can become informed about their world.
"United We Stand" Click Here To Create Your Personal Citizen Journalist Account Today, Be Sure To Invite Your Friends.
Before It’s News® is a community of individuals who report on what’s going on around them, from all around the world. Anyone can join. Anyone can contribute. Anyone can become informed about their world. "United We Stand" Click Here To Create Your Personal Citizen Journalist Account Today, Be Sure To Invite Your Friends.
LION'S MANE PRODUCT
Try Our Lion’s Mane WHOLE MIND Nootropic Blend 60 Capsules
Mushrooms are having a moment. One fabulous fungus in particular, lion’s mane, may help improve memory, depression and anxiety symptoms. They are also an excellent source of nutrients that show promise as a therapy for dementia, and other neurodegenerative diseases. If you’re living with anxiety or depression, you may be curious about all the therapy options out there — including the natural ones.Our Lion’s Mane WHOLE MIND Nootropic Blend has been formulated to utilize the potency of Lion’s mane but also include the benefits of four other Highly Beneficial Mushrooms. Synergistically, they work together to Build your health through improving cognitive function and immunity regardless of your age. Our Nootropic not only improves your Cognitive Function and Activates your Immune System, but it benefits growth of Essential Gut Flora, further enhancing your Vitality.
Our Formula includes: Lion’s Mane Mushrooms which Increase Brain Power through nerve growth, lessen anxiety, reduce depression, and improve concentration. Its an excellent adaptogen, promotes sleep and improves immunity. Shiitake Mushrooms which Fight cancer cells and infectious disease, boost the immune system, promotes brain function, and serves as a source of B vitamins. Maitake Mushrooms which regulate blood sugar levels of diabetics, reduce hypertension and boosts the immune system. Reishi Mushrooms which Fight inflammation, liver disease, fatigue, tumor growth and cancer. They Improve skin disorders and soothes digestive problems, stomach ulcers and leaky gut syndrome. Chaga Mushrooms which have anti-aging effects, boost immune function, improve stamina and athletic performance, even act as a natural aphrodisiac, fighting diabetes and improving liver function. Try Our Lion’s Mane WHOLE MIND Nootropic Blend 60 Capsules Today. Be 100% Satisfied or Receive a Full Money Back Guarantee. Order Yours Today by Following This Link.
