Our digital age is generating data at an unrelenting pace. Some estimates put it at more than one exabyte (1 billion gigabytes) per day, which is enough to fill the hard disks of a million PCs.
This data comes from myriad sources, including financial transactions, medical records and scientific experiments and simulations — as well as the ubiquitous videos and selfies inundating the Internet.
The sheer volume of data can be overwhelming and frustrating. How many times have you looked for a particular photo on your smartphone, knowing it is in there somewhere, but are unable to find it among the hundreds or even thousands on the device? At work, have you ever stared at a spreadsheet in an effort to garner insight? And we’re all increasingly aware that our most personal data are vulnerable to theft and misuse.
The challenge of securely managing and analyzing vast amounts of data to extract information — and ultimately understanding — is among the greatest in computer science today. This is the so-called “big data” problem you may have read about.
At the Department of Energy’s Pacific Northwest National Laboratory, we are developing sophisticated mathematical techniques and software tools that address this challenge. This field is known as data analytics, and it is one of our core strengths, along with chemistry and environmental sciences.
By working in interdisciplinary teams, PNNL researchers are applying their world-leading capabilities to a variety of problems and end uses. For example, we employ graph theory and machine learning techniques to identify national security threats and better manage the electric grid. These capabilities also yield exciting discoveries in the biological, chemical and climate sciences.
We also lead the development and deployment of information visualization tools that help users “see” patterns in data quickly and easily. Some of these breakthrough capabilities to sift through mountains of data and display analyses in ways that humans can interpret and explore were developed more than a decade ago. Of course, there is still much more to do to keep up with the quantity and complexity of real-time data streams.
One of our internally funded initiatives focuses on inventing new techniques to automate the process of creating and validating hypotheses. The initiative is called Analysis in Motion, or AIM, and we are investing in it because we believe that the ability to make sense of data at larger volumes and faster speeds is foundational to many of PNNL’s programs. It also opens the door to data-driven discovery that is less dependent on human-centric and ad hoc manual processes. In all of these areas, we need to identify and interpret phenomena as they emerge and adapt our data collection and analysis methods as they evolve.
As a computational mathematician, I focused most of my career on running computer simulations that generated large amounts of data. It was not until I teamed with colleagues at the forefront of scientific visualization, however, that I could make sense of this data. This experience sold me on the importance of data analytics. Today’s experts are tackling problems we could not even comprehend when I was an active researcher.
PNNL computer scientists are also working on ways to protect valuable data and to secure the networks where it resides. We are making great strides in cyber security, which is a hot topic these days, and we do an increasing amount of basic research in this area. For example, we are collaborating with Washington State University to detect cyber attacks in complex computing networks. Our tool, called StreamWorks, represents the network as a large, dynamic graph and uses algorithms to query the graph and identify patterns of suspicious behaviors.
The team tested the tool on two real-world data sets — an online news stream from The New York Times and Internet network traffic from the Center for Applied Internet Data Analysis. In both cases, it produced efficient continuous queries on dynamic graphs that could reach speeds up to 100 times faster than current methods that do not support incremental pattern matching algorithms.
Advancements like this take us one step closer to detecting and deterring adversarial actions in real time. For 10 years, PNNL has been a leader in cyber security capabilities that analyze and protect DOE networks. As the leader of the DOE Cooperative Protection Program and Cyber Intelligence Center, we are responsible for the research of next-generation cyber security sensors, standards for secure communication protocols and operational analysis of the integrity and security of cyber networks.
Looking to the future, the need for data analytics and cyber security will only grow in importance.
Working with DOE, other sponsors and collaborators, we will continue to deliver essential tools for uncovering new knowledge from scientific research and protecting critical information. And in so doing, we will help make “big data” a bit less intimidating.