Controversial Co-Dependents: Data Mining and Information Sharing

To death and taxes we might add the early 21st century Absolute that about once a year there will be a firestorm ignited around the use of data mining by government agencies.

The most recent came as part of the media dust-up regarding warrant-less searches or so-called “domestic spying,” and engaged controversy over how the National Security Agency used phone logs among other related matters.

Similar storms of recent vintage have rattled a program for airport passenger pre-screening, a platform for interoperability across state police systems, and the Pentagon’s ambitious plan to link its search for terrorists to many if not most of the world’s databases.

All of these occasions incited privacy and civil liberties attacks that made data mining dirty words in the press and blogosphere; and whenever that happens, federal IT programs tend to get hurt.

It would probably surprise the average American that most data mining in government is beyond controversy, occurring largely within resources developed by government, or for government, and requires no forays across the 4th amendment’s evolving and murky digital boundary.

The coherent querying of ones own records is often described as part of the process in which an agency makes sure “it knows what it already knows,” as the Defense secretary and Joint Chiefs like to say.

An example of such a system was implemented two years ago by the Federal Aviation Administration. It essentially polls information from FAA’s own network of deployed IT security systems so as to provide analysts and technicians with a strategic platform on which to keep ahead of vulnerabilities, and a tactical resource to use when response to a cyber incident is required.

The various flash points of criticism that data mining attracts inevitably follow the use of what are sometimes called “open source” databases. Such sources might include credit reports, phone company logs, bank records, and other transactional trails individuals commonly generate as part of their daily lives.

Although termed by one insider “a relatively young technology still finding its way,” there is little disputing the technological efficacy of data mining to advance the broader cause of information sharing in government. An implicit acceptance of data mining is encapsulated in the Patriot Act, the 9/11 Commission recommendations, and the Intelligence Reform act.

Reliance on data mining has been incumbent in the processes of more than a dozen information sharing and analysis centers (ISACs) for better infrastructure protection at home, and is even relied on across national boundaries through programs like the Container Security- and Proliferation Security initiatives, with links built between the U.S. and in some cases more than 60 other nations.

Domestic programs like the national “Amber Alert” system also draw on data mining-supported information sharing built into support systems like Global Justice XML and other standard sharing models. The relationship between mining and sharing is often fundamental. Speaking generally of the intelligence community, Randy Ridley, vice president and general manager of MetaCarta, a geographic systems provider to the intelligence agencies, other agencies and industry, said, “Without data mining there would be nothing to share.”

The need for data mining is generally dated back to the improvement in database management systems in the early 1990s, when massive data warehouses for government and industry were first being built. The build-up of repositories capturing object-oriented data, as well as burgeoning relational systems and legacy flat files, increasingly demanded a coherent way for users to readily obtain mission-specific information.

If anything, the need for point-specific data mining increased exponentially as multi-media has flooded the Internet in the “Google” era. By example, in the wake of the July 7, 2005, terror attacks in London, British law enforcement agencies gathered and perused about 200,000 hours of video tape as part of their investigation of the attack.

The ability to query remains the root device in the data mining toolkit; but it is pared into a number of applications-level disciplines today. Systems for link analysis, semantic searching and other refinements add “granularity” to tasks where “less Search and more Find,” is the ultimate goal, as one expert noted.

As early as 1994 the intelligence community launched its Intelink intranet, one of the first federal platforms in which information sharing and data mining were implemented in a single mission, it being the production of intelligence reports for the White House and military.

Today, the data mining element of Intelink is beefed up with applications like MetaCarta’s geographical text search (GTS) mechanism, used when analysts want their information organized geographically, specifically around the longitude and latitude of the subject in question, whether the media is email, web page, news bulletin, cable or comes from another source.

“The results are displayed on a map with icons representing the locations found in the natural language text of the documents,” Ridley said. The GTS is just one component in one of the systems used to fulfill the goal of intelligence reform, and is driven by “sub-second” searches across millions of indexed documents residing on top secret networks.

The baseline belief that disparate intelligence agencies can be coherently integrated probably would have had little traction in Congress or among the 9/11 Commission, which so actively advocated reform, but for a companion belief in data mining. When, last year, the Defense Science Board urged DOD to improve its analysis of open source intelligence, there was likewise a de facto acceptance of the abilities of data mining built into the recommendation.

Government now has a growing body of information sharing experiences to draw on including at formal Information Sharing and AnalysisCenters, the NationalCounterTerrorismCenter, multiple links built by U.S. Northern Command to and from law enforcement and first responders, and other activities and events.

In most all these cases, data mining and like systems support the tagging or cataloging of data or the budding “write to release” system for sharing information. Systems once hard-wired to “walled” information are increasingly being expanded to support a process termed “post and analyze,” in which gleaned information is immediately shared rather than horded.

The need for both new data mining apps and an increasingly robust infrastructure for performing it has resulted in what has evolved into a “disruptive and enabling” technology, said Anne Wheeler, a former IBM senior engineer who (along with her husband Lynn Wheeler) helped build data mining utilities for government and industry including an open source semantic search utility called Dynasty.

Wheeler said that networking architectures and storage and backup systems must increasingly account for the reality that queries will be many, varied, and that users will want very exacting applications, or what Randy Ridley of GIS-specialist company MetaCarta calls “cohesive and timely personalized information delivery tools.”

The need to deal with both “structured and unstructured data, or to go right to source data and be able to use that” is what drove organizations to move data mining from the desktop to the enterprise level, Ridley said. The GTS, for instance, as used in the intelligence community is built into a web-based, services oriented design where IT shops maintain it as another appliance in the enterprise, for analysts. The system also depends heavily on integrated document management tools.

Increasingly, data mining has been focused on possible real-time uses in which operational data can be turned into the stuff of decision-making as events unfold. A recent upgrade in British law enforcement created a Violent and Sex Offenders Registry that operates real-time, with data capture and information posting accomplished all at once. Many governmental activities, however, are not ready to mine operational data, either organizationally or technologically.

David Carrick, managing director of Memex, a company that specializes in information sharing for law enforcement in the and , recently noted that “very few [police] applicationsâÂ?¦have been developed with the ability to be interrogated with ad hoc queries during day-to-day investigations.”

The ideal that all data “contained within your existing operational system should be fully searchable from a single application,” as Carrick put it, has been a long time coming in some agencies-and might still be a long time coming in others.

In some cases, data mining lost five years awaiting data standardization programs that still have not reached fruition. In the interim, builders of data mining tools developed unstructured systems to account for discordant data types or even account for disparities down to proper name spellings. Wheeler noted that advanced systems look at all data, including metadata, equally to ferret out meaningful results.

Assessing the value and worth of data mining dates back to the oft-cited example of a supermarket chain that, in the 1990s, determined that putting baby diapers near the cold case in its stores would result in greater sales of beer in evening hours, when young fathers are sent out to buy diapers. The process used was fundamentally link analysis, in which an invisible set of relationships was rooted in the data, exculpated through mining, and then put to use as part of a strategy.

Semantic or “natural language” approaches to data mining increasingly account for the predilections of the user first and make associative connections rather than one-to-one links across tables of data. Wheeler described the legacy approach to data mining as “trying to write the history of the world in spreadsheet form.”

Where associations are complex, older systems restricted to structured data might either break down or miss the information most vital to the user. Older systems also cannot accommodate the billions of documents that enterprise systems are specified to deal with, Ridley said.

“Data mining systems are increasingly geared toward large production systems running multiple tools,” he said. “The challenge in the intelligence community is that analysts tend to want a single button that says ‘Answer’ on it that they can press.”

Short of a silver bullet, data mining “entity extractors” are used. These are software tools that scan unstructured text, identify what should be marked up with XML, and then send it to structured systems from which it can be queried-as the GTS “geo-tagger” system can do “with millions of documents per day” if the mission demands it, Ridley said.

Entity extractors can be built around user needs – whether for geographic information, or tracking people, events, organizations, specific trends, general patterns, and so forth. “From an architecture standpoint, it’s very attractive, very uniform.” Such systems can also be benchmarked by potential users against an “F-measure” system for determining the effectiveness and precision of searches, Ridley said.

The vision of the 9/11 Commission was that data mining be plugged into information sharing platforms to create a real-time mechanism for generating “actionable” intelligence that might run from the CIA overseas all the way to a police officer on a beat in the , thus employing “all the tools in the toolkit,” as one commissioner put it.

For that to happen, however, Anne Wheeler said that agencies will have to migrate data mining from an activity often limited to temporal computer memory to stronger use of long-term storage, where not only the data but its meaningful relationships can be preserved in the original format.

“Too much data is still being force-fit into tables” (columns and rows), limiting the completeness of searches, said Lynn Wheeler, chief scientist at First Data Corp. and developer of the Dynasty HTML system. He said that while infrastructure costs might rise when data structure is permanently stored and “persistently mapped,” costs are alternately reduced on the operational side. Less IT staff is required when “domain experts can build their own databases without having to get their information re-organized first” or await “a centralized organizational effort,” he said.

Regardless of the exact technology used, evidence indicates that as agencies better leverage source data the goal of information sharing is enhanced. But, as we have seen, the more agencies use source data originating even a step outside their own repositories, the more likely they are to encounter program-stopping controversy. In some cases, the best technology might be “too good” for government.

Addendum:

Data Mining: Trap Doors Await

If the brief history of data mining is already a two-part saga, Part Two began in the immediate aftermath of the 9/11 terror attacks on America.

Rather than gaining momentum, efforts such as Defense’s Terrorism Information Awareness program, the Justice-initiated MATRIX system for identifying potential terrorists on integrated state police systems, and the Homeland Security department’s Computer Assisted Passenger Pre-Screening-2 system for airline travel were eliminated or put on hold in response to privacy and civil liberties objections; which had similarly been leveled pre-9/11 at the FBI’s former Carnivore system and recently as part of media charges of domestic spying at the National Security Agency.

Objections to data mining follow arguments that government restrict itself to its own data and stay away from open sources. But in the same period that the momentum has run against data mining, the movement in favor of information sharing as promoted by the 9/11 Commission and intelligence reform has gained ground and simultaneously recommended that government better leverage open sources of information.

Thus, we are left unraveling a sequence in which, for instance, in 2002 more than a dozen information sharing and analysis centers were formed to better protect critical infrastructure including telecommunications, but by late 2005 it was the stuff of scandal-whisperers that the telecom industry was sharing security-related information with government as part of the hunt for terrorists. Some controversies only find logic inside the beltway.

Trap doors await those who will continue to advocate data mining-based information sharing, if recent history applies. In the same period that TIA, the MATRIX and CAPPS-2 collapsed under political criticism, Congress moved about 400 new privacy laws onto the books, each with its own set of definitions that might or might not apply to open sources of information as systems evolve that can leverage them.

As well, the debate about limiting the Patriot Act has occurred with nary a word about how many recently built sharing systems and/or links built across systems would have to be unplugged when or if the law is altered-and whether changes would, perhaps, even impact the legality of the Homeland Security department itself, which was formed on the platform the Act provided.

The winds also moan with a call at one end to declassify more information and thus increase sharing organically across government agencies, and from other quarters for a relaxation of Freedom Of Information Act restrictions, which might also open more information to search/find mechanisms.

Is it ironic that some of the same interests and organizations that want the government to use less open source data want more government data made generally available? Maybe not. Information sharing is sometimes “quipped” rather than defined as a process in which, “You give me your information, and I give you nothing.”

Leave a Reply

Your email address will not be published. Required fields are marked *


9 − = two