“Black Holes” in Enterprise Information

April 27, 2008

Yesterday–trapped once again in the nausea-inducing SEATAC Airport–I talked on the telephone with a caller concerned about problem areas in enterprise information. The issue, as I understood her comments, had to do with launching a search and retrieval system’s crawler or content acquisition “bot” and then running queries to see what was on publicly-accessible folders and servers within the organization.

My comment to her was, “You may want to perform a content inventory, do some testing of access controls, and do some very narrowly focused tests.”

Her response was one I hear frequently from 30-somethings, children of the approval culture: “Why?” These wonderful people have grown up with gold stars on lousy book reports, received “You’re a Champ” T shirts for miserable under-10 soccer efforts, and kisses upon graduating from university with a gentle person’s “C”.

I did what I could to flash the yellow caution signal, but this call, like so many others I get, was “tell me what I want to hear” inquiry, not a “real world” information buzz. The caller wanted me to say, “What a great idea!” Sorry. I’m the wrong guy for that cheerleading.

A Partial List of Black Holes

Here is my preliminary list of enterprise information “black holes”. A black hole is not well understood. My weak intellect thinks that a black hole is a giant whirlpool with radiation, crushing gravity, and the destruction of chubby old atoms such as the ones that make me the doddering fool I am. To wit:

School, religious, bake sale, and Girl Scout information in email and any other file formats, including Excel, image files, and applications that send email blasts
MP3 and other rich media files that are copyrighted, pornographic, or in some way offensive to management, colleagues, or attorneys. This includes vacation photos of overweight relatives and spouses wearing funny hats.
Information in emai or other formats pertaining to employee compensation, health, job performance, or behavior. Think discovery. Think deposition. Think trial.
Any data that is at variance with other information vetted and filed at a regulatory body; for example, marked up copies of departmental budgets, laboratory reports, clinical trial data, agreements between a vendor and a manager, and similar “working draft” information. Think how you and your colleagues would look on the six o’clock news in orange jump suits.
Software installed or copied to a hard drive that is hacked, borrowed, or acquired from an online source not known to be free from spyware, backdoors, keyloggers, and trojans. Think big, big fine.
Information about defeating firewall filters or other security work arounds needed to allow access to Web sites, information, or services that are not permitted by the firm’s security officer, agreements between the firm and a law enforcement or intelligence entity. Think losing lucrative pork barrel goodies.
Information germane to a legal action that has not been provided to the firm’s legal counsel regardless of the holder of the information role in the company or that person’s knowledge of a legal matter making the information pertinent to the legal matter. Think of hours with attorneys. Uggh. This makes me queasy typing the words.
Email threads discussing behaviors of employees and their dealings with vendors, co workers, business partners, and consultants in which non-work related topics are discussed. Think Tyco, Enron, and other business school case studies about corporate ethics.

Do you have examples of other “black holes”?

In the run up to the release of the index of the US Federal government’s public facing Web sites, I recall sitting in a meeting to discuss the test queries we were running in the summer of year 2000. My own queries surfaced some interesting information. I stumbled upon a document that when opened in an editor carried a clear statement that the document was not to be made public. The document was removed from the crawl and its index pointer deleted. My recollection is hazy, but the test queries surfaced a great deal of information that I certainly did not expect to be sitting on a publicly-accessible server.

To greater and lesser degrees, I’ve learned that test crawls that suck information into a search system’s crawler almost always yields some excitement. The young, hip, enthusiastic search engine managers don’t realize the potential downside of indiscriminate “test indexing”.

Tips on How to Avoid a Black Hole

Here are my suggestions for avoiding self-destruction in an information “black hole”:

Do a thorough content inventory, define a narrow test crawl, and expand the crawl on a schedule that allows time to run test queries, remove or change the access flag on problematic information
Coordinate with your firm’s security and legal professionals. If you don’t have these types of employees sitting in their offices eager to help you, hire a consultant to work with you
Run exhaustive test queries * before * you make the search system available to the users. An alpha test followed by a slightly more expansive beta test is a useful pre release tactic
Inform your co workers about the indexing or “indexation” process so your co workers have time to expunge the grade school’s bake sale promotional literature, budget, and email list from the folder the spider will visit
Inform management that if problematic information turns up, the search system may be offline while the problem is rectified in order to avoid surprises.

I will let you know if she calls me back.

Stephen Arnold, April 27, 2008

Written by Stephen E. Arnold · Filed Under Enterprise, Feature, Search

Comments

Comments are closed.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.