Random Walk
A Universe of Astronomical Data
Long gone are the days of the lone astronomer perched atop a mountain, peering through an eyepiece at a smudge in the night sky. Modern astronomers, though, still have to go out and get their own data, either spending nights in the telescope’s control room or observing remotely.
But for the astronomer of 2020, data will automatically come to her before she wakes up in the morning.While she sleeps, telescopes on autopilot will comb the night sky, feeding many terabytes (a thousand billion bytes) of data into computers. The computers will then mine the data, flagging any intriguing cosmic events, such as new supernovae or explosive gamma-ray bursts, for our astronomer to peruse while sipping her morning brew. Computers will also have collected information from other databases that may be relevant—multiwavelength images of that part of the sky, for example.
Later, she might log onto some three-dimensional virtual environment, where virtual representations of herself and astronomers from around the globe will literally immerse themselves in the data—a colorful three-dimensional cluster of spheres that represent the hundreds of characteristics that describe, for instance, a supernova—analyzing the supernova in ways that would have been inconceivable a mere 10 years earlier.
Welcome to astroinformatics—an amalgam of astronomy and computer science. Increasingly powerful instruments, more sensitive telescopes, and ever-faster computers are inundating the astronomical community with far more data than it ever imagined. The next generation of sky surveys is expected to uncover billions of stars and galaxies and lots of other interesting things, such as supernovae, quasars, exoplanets, and possibly new objects that have yet to be discovered—each with hundreds of parameters.
The Large Synoptic Survey Telescope project, of which Caltech is a partner institution, will produce petabytes of data—roughly the amount of information contained in a billion books. Another project planned for the future, the Square Kilometre Array , is expected to yield exabytes of data—a billion billion bytes. By comparison, Caltech’s been involved with numerous surveys—such as the Digitized Palomar Sky Survey , Palomar-Quest , the Palomar Transient Factory , and the Two-Micron All Sky Survey —that have generated tens of terabytes of data, a factor of at least a thousand less than what future surveys will produce. And the size of data will follow Moore’s Law, doubling every couple of years.
It’s not just that the databases will be huge. They’ll be complex—imagine trying to visualize, much less understand, hundreds parameters all at once. Sifting through the numbers is humanly impossible, so Caltech astronomers are taking the lead in developing tools to process, analyze, and understand this deluge of information.
For example, the so-called Virtual Observatory (VO), which Caltech has been a leader in developing since its inception 10 years ago, is a way to integrate all the data that’s being collected by telescopes from around the world and in orbit. Every telescope’s data sets are different, not only in terms of what information is gathered—Chandra is a space telescope that looks at X-rays, and Keck is an optical scope on top of Mauna Kea, for example—but also in format and how they’re accessed. But with the VO, an astronomer can enter a query on the computer and get to all the relevant databases at once, without having to learn the quirks and technicalities that accompany each data set. “Federating is the word that’s normally used for this,” says Matthew Graham, a computational scientist at Caltech’s Center for Advanced Computing Research (CACR). “You’re federating these different data sets and then using online services to do things with them.”
After a decade of developing the tools and infrastructure needed to get these databases to talk to each other, the project, now called the Virtual Astronomical Observatory and funded by NASA and the NSF, opened for business in May. “We’re moving onto the operational phase,” says Graham, a member of the program council of the VAO. “The hope is that we can really make an impact on the community.” In addition to Graham, CACR computational scientist Roy Williams (PhD ’83) also plays a leading role with the VAO. Others at Caltech who are involved with astroinformatics include CACR executive director Mark Stalzer, CACR computational scientist Andrew Drake, executive director of the Infrared Processing and Analysis Center (IPAC) George Helou, IPAC scientists Joe Mazzarella and Bruce Berriman, postdoc Ciro Donalek, staff scientist Ashish Mahabal, and others.
But there’s more to astroinformatics than just bigger telescopes, better computers, and more sophisticated software. “It’s not just the same old stuff with more data, but genuinely new things,” says George Djorgovski, professor of astronomy and principal investigator for Caltech’s part of the VAO consortium. “We’ll be able to ask questions that we couldn’t dream of asking before, just because we didn’t have the tools or the data.” This past June, Djorgovski was one of the organizers of the first astroinformatics conference, held at Caltech’s Cahill Center for Astronomy and Astrophysics and attracting about a hundred astronomers from around the world. Participants spent four days discussing a wide array of topics, ranging from data mining and computation to education and outreach. The conference was even broadcast live on the Web, and participants posted comments on Twitter during the talks.
The outreach goes far beyond education, as astronomers are actually soliciting the public’s help. You’re probably familiar with SETI@home, a program that uses your computer’s downtime to search radio-telescope data for signs of extraterrestrial life. Astronomers want to take advantage of your brain, as well as your laptop. Galaxy Zoo, for example, is a website that asks users to classify thousands of individual galaxies from the Sloan Survey. Distinguishing a spiral galaxy from an elliptical one is a complex problem for a computer, but a simple one for a human. With more than 250,000 users, Galaxy Zoo has spawned similar projects to help astronomers sift through data taken by other missions, like the Lunar Reconnaissance Orbiter and the Hubble Space Telescope.
But even turning the entire world into an astronomy sweatshop will fall short. “There aren’t enough humans on the planet to handle the data right now,” Graham says. So researchers like him want to take this idea of “citizen astronomy” farther and figure out how citizen scientists interpret data to develop smarter data-mining algorithms. For instance, a human can identify a bright light in the spiral arm of a galaxy as a supernova, because we know that an exploding star has to live in a galaxy. Understanding this kind of contextual information is hard for a computer, but if machine-learning researchers analyze enough images in which supernovae have been spotted by humans, then some other characteristics that a computer can process may be uncovered.
This sort of technique, part of a subfield called semantic astronomy, is still in its early stages, Graham says. But a lot of astroinformatics will involve similar tools that turn the computer from a number-crunching machine into an intelligent assistant. Instead of having to mine through different databases and pick out the relevant numbers by hand, an astronomer could just type in a query in plain English—for example, “find all the data on stars within 100 light-years of us”—and the computer would cull all the relevant information from every database available, leaving the astronomer free to focus on the science.
Of course, astronomy is far from being the only field overwhelmed with information. The burgeoning field of bioinformatics has been transforming biology for the past decade. Other sciences are facing similar challenges: real-time sensors monitoring everything from earthquakes to climate are generating a barrage of data, and Moore’s Law is driving an exponential growth in information. “Any science that’s using semiconductors to do detection is suddenly becoming data-intensive,” says CACR’s Mark Stalzer.
“Science in the 21st century is going to be different,” Djorgovski adds. “The focus is shifting from having better hardware to having better software and methodology. It’s going from atoms to bits to knowledge.” With the smart phones, social networking, and news feeds that inundate our everyday lives, we’re all experiencing a torrent of information. And like the rest of us, astronomers are learning to deal with it. —MW
If you want to learn more about astroinformatics, you can find slides and videos of all the talks from the Astroinformatics 2010 conference at www.astroinformatics2010.org.

This image from Second Life shows two avatars immersed in a virtual world of three-dimensional data.

