Google Autocomplete: Is Smart Help a Hindrance?

September 10, 2012

You may have heard of the deep extraction company Attensity. There is another company in a similar business with the name inTTENSITY. Not the playful misspelling of the common word “intensity.” What happens when a person looking for the company inTTENSITY get when he or she runs a query on Google. Look at what Google’s autocomplete suggestions recommend when I type intten:

image

The company’s spelling appears along with the less helpful “interstate ten”, “internet explorer ten”, and “internet icon top ten.” If I enter “inten”, I don’t get the company name. No surprise.

image

Is Google’s autocomplete a help or hindrance? The answer, in my opinion, is it depends on the users and what he or she is seeking.

I just read “Germany’s Former First Lady Sues Google For Defamation Over Autocomplete Suggestions.” According to the write up:

When you search for “Bettina Wulff” on Google, the search engine will happily autocomplete this search with terms like “escort” and “prostitute.” That’s obviously not something you would like to be associated with your name, so the wife of former German president Christian Wulff has now, according to Germany’s Süddeutschen Zeitung, decided to sue Google for defamation. The reason why these terms appear in Google’s autocomplete is that there have been persistent rumors that Wulff worked for an escort service before she met her husband. Wulff categorically denies that this is true.

The article explains that autocomplete has been the target of criticism before. The concluding statement struck me as interesting:

In Japan, a man recently filed a suit against Google after the autocomplete feature started linking his names with a number of crimes he says he wasn’t involved in. A court in Japan then ordered Google to delete these terms from autocomplete. Google also lost a similar suit in Italy in 2011.

I have commented about the interesting situations predictive algorithms can create. I assume that Google’s numerical recipes chug along like a digital and intent-free robot.

Read more

More Content Processing Brand Confusion

September 7, 2012

On a call with a so-so investment outfit once spawned from JP Morgan’s empire, the whiz kids on the call with me asked me to name some interesting companies I was monitoring. I spit out two or three. One name created a hiatus. The spiffy young MBA asked me, “Are you tracking a pump company?”

I realized that when one names search and content processing firms, the name of the company and its brand are important. I was referring to an outfit called “Centrifuge”, a firm along with dozens if not hundreds of others in the pursuit of the big data rainbow. The company has an interesting product, and you can read about the firm at www.centrifugesystems.com.

Now the confusion. Google thinks Centrifuge business intelligence is the same as centrifuge coolant sludge systems. Interesting.

relationship detail image

There is a pump and valve outfit called Centrifuge at www.centrisys.us. This outfit, it turns out, has a heck of a marketing program. Utilizing YouTube, a search for “centrifuge systems” returns a raft of information timber about viscosity, manganese phosphate, and lead dust slurry.

I have commented on the “findability” problem in the search, analytics, and content processing sector in my various writings and in my few and far between public speaking engagements. My 68 years weigh heavily on me when a 20-something pitches a talk in some place far from Harrod’s Creek, Kentucky.

The semantic difference between analytics and lead dust slurry is obvious to me. To the indexing methods in use at Baidu, Bing, Exalead, Google, Jike, and Yandex—not so much.

How big of a problem is this? You can see that Brainware, Sinequa, Thunderstone, and dozens of other content-centric outfits are conflated with questionable videos, electronic games, and Latin phrases. When looking for these companies and their brands via mobile devices, the findability challenge gets harder, not easier. The constant stream of traditional news releases, isolated blog posts, white papers which are much loved by graduate students in India, and Web collateral miss their intended audiences. I prefer “miss” to the blunt reality of “unread content.”

I am going to start a file in which to track brand confusion and company name erosion. Search, analytics, and content processing vendors should know that preserving the semantic “magnetism” of a word or phrase is important. Surprising it is to me that I can run a query and get links to visual network analytics along side high performance centrifuges. Some watching robots pay close attention to the “centrifuge” concept I assume.

Brand management is important.

Stephen E Arnold, September 7, 2012

Sponsored by Augmentext

Twitter Politics

August 31, 2012

Oh, goody, more predictive silliness. TechNewsWorld informs us, “Twindex Tracks Pols’ Twitter Temperatures.” Clever name, though it does make me think more about window cleaning than about politics. That’s ok; window cleaning is the more engaging subject.

The full name of the metric is the Twitter Political Index, and it tracks tweeters’ daily thoughts about the two presidential candidates. Twitter created the index with the help of Topsy Labs and pollsters at the Mellman Group and North Star Opinion Research. The polling firms helped validate and tune the algorithms. It is Topsy’s job to track tweets for certain terms and compare sentiment on each candidate. So far, the incumbent seems to be well ahead in the Twittersphere.

But how far can we trust the Twindex? Probably not very far. Writer Richard Adhikari observes:

The Pew Research Center has found that only 15 percent of adults online use Twitter. On a typical day, that figure is only 8 percent. . . .

“Overall, nearly 30 percent of young adults use Twitter, up from 18 percent the previous year. One in five people aged 18 to 24 uses Twitter on a typical day.

“Further, 11 percent of adults aged 25 to 34 use Twitter on a typical day.

“African-Americans are also heavy Twitter users, with 28 percent of them using Twitter overall and 13 percent doing so on a typical day.

“Urban and suburban residents are also significantly more likely to use Twitter than those in rural areas, Pew found.”

So, yeah, statistically Democrats are likely to fare better among Twitter users than Republicans. This index is about as valuable as any political echo chamber—for entertainment only. Personally, I’d rather be washing windows.

Cynthia Murrell, August 31, 2012

Sponsored by ArnoldIT.com, developer of Augmentext

Document Management Is Ripe For eDiscovery

July 18, 2012

If you work in any aspect related to the legal community, you should be aware that eDiscovery generates a great deal of chatter. Like most search and information retrieval functions, progress is erratic.

While eDiscovery, according to the marketers who flock to Legal Tech and other conferences, will save clients and attorneys millions of dollars in the long run, there will still be some associated costs with it. Fees do not magically disappear and eDiscovery will have its own costs that can accrue, even if they may be a tad lower than the regular attorney’s time sheets.

One way to keep costs down is to create a document management policy, so if you are ever taken to court it will reduce the amount of time and money spent in the litigation process. We have mixed feelings about document management. The systems are often problematic because the management guidance and support are inadequate. Software cannot “fix” this type of issue. Marketers, however, suggest software may be up to the task.

JD Supra discusses the importance of a document management plan in “eDiscovery and Document Management.” The legal firm of Warner, Norcross, and Judd wrote a basic strategy guide for JD Supra for people to get started on a document management plan. A plan’s importance is immeasurable:

“With proper document management, you’ll have control over your systems and records when a litigation hold is issued and the eDiscovery process begins, resulting in reduced risk and lower eDiscovery costs. This is imperative because discovery involving electronically stored data — including e-mail, voicemail, calendars, text messages and metadata — is among the most time-consuming and costly phases of any dispute. Ultimately, an effective document management policy is likely to contribute to the best possible outcome of litigation or an investigation.”

The best way to start working on a plan is to outline your purpose and scope—know what you need and want the plan to do. Also specify who will be responsible for each part of the plan—not designating proper authority can leave the entire plan in limbo. Never forget a records retention policy—it is legally require to keep most data for seven years or permanently, but some data can be deleted. Do not pay for data you do not have to keep. Most important of all, provide specific direction for individual tasks, such as scanning, word management, destruction schedule, and observing litigation holds. One last thing, never under estimate the importance of employee training and audit schedules, the latter will sneak up on you before you know it.

If, however, you still are hesitant in drafting a plan can carry some hefty consequences:

  • “Outdated and possibly harmful documents might be available and subject to discovery.
  • Failure to produce documents in a timely fashion might result in fines and jail time: one large corporation was charged with misleading regulators and not producing evidence in a timely matter and was fined $10 million.
  • Destroying documents in violation of federal statutes and regulations may result in fines and jail time: one provision of the Sarbanes-Oxley Act specifies a prison sentence of up to 20 years for someone who knowingly destroys documents with the intent to obstruct a government investigation.”

A document management plan is a tool meant to guide organizations in managing their data, outlining the tasks associated with it, and preparing for eventual audits and litigation procedures. Having a document management plan in place will make the eDiscovery process go quicker, but another way to make the process even faster and more accurate is using litigation support technology and predictive coding, such as provided by Polyspot.

Here at Beyond Search we have a healthy skepticism for automated content processing. Some systems perform quite well in quite specific circumstances. Examples include Digital Reasoning and Ikanow. Other systems are disappointing. Very disappointing. Who are the disappointing vendors? Not in this free blog. Sign up for Honk!, our no holds barred newsletter, and get that opt-in, limited distribution information today.

Whitney Grace, July 18, 2012

Sponsored by Polyspot

Trimming Legal Costs and Jobs: A Predictive Coding Unintended Consequence?

July 17, 2012

Predictive coding and eDiscovery are circling the legal communities gossip rings about what it means for the future of legal costs and jobs. The Huffington Post addresses the topic in “ ‘Lawyerbots’ Offer Attorneys Faster, Cheaper Assistants.” The US court system has made new regulations when it comes to eDiscovery technology and how it can be used in court cases. Lawyer, legal professionals, and even the companies licensing various programmatic content processing systems are struggling to understand the upside and downside of the algorithmic approach to coding. One-way eDiscovery and predictive coding will be used is to cut down on the many, many hours of post-processing some electronic documents. This new technology is being referred to as “lawyerbots.”

Lawyerbots cut through the man-hours like an electric knife, saving time and clients money. Many are optimistic about the changes. But some clients are ambivalent:

“But how will clients feel about a computer doing some of the dirty work, instead of a lawyer or paralegal manually digging through documents? Some could be concerned that a computer is more apt to make an error, or overlook crucial information. In a recent study in the Richmond Journal of Law and Technology, lawyer labor was tested against lawyerbots with predictive coding software. Researchers found “evidence that such technology-assisted processes, while indeed more efficient, can also yield results superior to those of exhaustive manual review.” In basic terms, the computers had the humans licked.”

Faster and more accurate! It is an awesome combination, but the next question to follow is what about jobs? There are several predictions already out there; the article mentions how Mike Lynch of Autonomy believes the legal community will employ fewer people in the future. Others are embracing the new technology pattern and plan to see changes as the older lawyers retire. Here’s one observation:

“Jonathan Askin, the director of Brooklyn Law School’s Brooklyn Law Incubator and Policy Clinic (BLIP)…said, ‘When I look around at my peers, I see 40-year-old lawyers who are still communicating via snail mail and fax machines and telephones and appearing in physical space for negotiations.’ He said he hopes to better merge the legal sector and technology to serve both lawyers and their clients more efficiently.”

We arrive at yet another crossroads: traditional, variable cost ways vs. new, allegedly more easily budgeted approach to content analysis.

As a librarian, I predict, without having to use predictive analytics that eDiscovery will take some legal occupations. Online wrecked havoc in the special library market. However, I am confident that there will still be a need for humans to keep the lawyerbots and maybe the marketers of these systems in check.

After all, software technology is only as smart as humans program it and humans are prone to error.  The lawyerbots will also drive down costs, a blessing in this poor economy, and more people will be apt to bring cases to court, increasing demand for lawyers. In order to get to this point, however, there needs to be an established set of standards on how litigation support software can be programmed, how it can be used, and basic requirements for the processes/code. What’s the outlook? Uncertainty and probably one step forward and one step backwards.

Whitney Grace, July 17, 2012

Sponsored by Polyspot

Google and Latent Semantic Indexing: The KnowledgeGraph Play

June 26, 2012

One thing that is always constant is Google changing itself.  Not too long ago Google introduced yet another new tool: Knowledge Graph.  Business2Community spoke highly about how this new application proves the concept of latent semantic indexing in “Keyword Density is Dead…Enter “Thing Density.”  Google’s claim to fame is providing the most relevant search results based on a user’s keywords.  Every time they update their algorithm it is to keep relevancy up.  The new Knowledge Graph allows users to break down their search by clustering related Web sites and finding what LSI exists between the results.  From there the search conducts a secondary search and so on.  Google does this to reflect the natural use of human language, i.e. making their products user friendly.

But this change begs an important question:

“What does it mean for me!? Well first and foremost keyword density is dead, I like to consider the new term to be “Concept Density” or to coin Google’s title to this new development “Thing Density.” Which thankfully my High School English teachers would be happy about. They always told us to not use the same term over and over again but to switch it up throughout our papers. Which is a natural and proper style of writing, and we now know this is how Google is approaching it as well.”

The change will means good content and SEO will be rewarded.  This does not change the fact, of course, that Google will probably change their algorithm again in a couple months but now they are recognizing that LSI has value.  Most IVPs that provide latent semantic indexing, content and text analytics, such as Content Analyst,have gone way beyond what Google’s offering with the latest LSI trends to make data more findable and discover new correlations.

Whitney Grace, June 26, 2012

Sponsored by Content Analyst

The Alleged Received Wisdom about Predictive Coding

June 19, 2012

Let’s start off with a recommendation. Snag a copy of the Wall Street Journal and read the hard copy front page story in the Marketplace section, “Computers Carry Water of Pretrial Legal Work.” In theory, you can read the story online if you don’t have Sections A-1, A-10 of the June 18, 2012, newspaper. Check out a variant of the story appears as “Why Hire a Lawyer? Computers Are Cheaper.”

Now let me offer a possibly shocking observation: The costs of litigation are not going down for certain legal matters. Neither bargain basement human attorneys nor Fancy Dan content processing systems make the legal bills smaller. Your mileage may vary, but for those snared in some legal traffic jams, costs are tough to control. In fact, search and content processing can impact costs, just not in the way some of the licensees of next generation systems expect. That is one of the mysteries of online that few can penetrate.

The main idea of the Wall Street Journal story is that “predictive coding” can do work that human lawyers do for a higher cost but sometimes with much less precision. That’s the hint about costs in my opinion. But the article is traditional journalistic gold. Coming from the Murdoch organization, what did I expect? i2 Group has been chugging along with relationship maps for case analyses of important matters since 1990. Big alert: i2 Ltd. was a client of mine. Let’s see that was more than a couple of weeks ago that basic discovery functions were available.

The write up quotes published analyses which indicate that when humans review documents, those humans get tired and do a lousy job. The article cites “experts” who from Thomson Reuters, a firm steeped in legal and digital expertise, who point out that predictive coding is going to be an even bigger business. Here’s the passage I underlined: “Greg McPolin, an executive at the legal outsourcing firm Pangea3 which is owned by Thomson Reuters Corp., says about one third of the company’s clients are considering using predictive coding in their matters.” This factoid is likely to spawn a swarm of azure chip consultants who will explain how big the market for predictive coding will be. Good news for the firms engaged in this content processing activity.

What goes faster? The costs of a legal matter or the costs of a legal matter that requires automation and trained attorneys? Why do companies embrace automation plus human attorneys? Risk certainly is a turbo charger?

The article also explains how predictive coding works, offers some cost estimates for various actions related to a document, and adds some cautionary points about predictive coding proving itself in court. In short, we have a touchstone document about this niche in search and content processing.

My thoughts about predictive coding are related to the broader trends in the use of systems and methods to figure out what is in a corpus and what a document is about.

First, the driver for most content processing is related to two quite human needs. First, the costs of coping with large volumes of information is high and going up fast. Second, the need to reduce risk. Most professionals find quips about orange jump suits, sharing a cell with Mr. Madoff, and the iconic “perp walk” downright depressing. When a legal matter surfaces, the need to know what’s in a collection of content like corporate email is high. The need for speed is driven by executive urgency. The cost factor clicks in when the chief financial officer has to figure out the costs of determining what’s in those documents. Predictive coding to the rescue. One firm used the phrase “rocket docket” to communicate speed. Other firms promise optimized statistical routines. The big idea is that automation is fast and cheaper than having lots of attorneys sifting through documents in printed or digital form. The Wall Street Journal is right. Automated content processing is going to be a big business. I just hit the two key drivers. Why dance around what is fueling this sector?

Read more

Protected: Pitney Bowes and iDiscovery Tack on Analytics

June 15, 2012

This content is password protected. To view it please enter your password below:

Microsoft SharePoint: Controlled Term Functionality

June 13, 2012

Also covered “SharePointSearch, Synonyms, Thesaurus, and You” provides a useful summary of Microsoft SharePoint’s native support for controlled term lists. Today, the buzzwords taxonomy and ontology are used to refer to term lists which SharePoint can use to index content. Term lists may consist of company-specific vocabulary, the names of peoples and companies with which a firm does business, or formal lists of words and phrases with “Use for” and “See also” cross references.

The important of a controlled term list is often lost when today’s automated indexing systems process content. Almost any search system benefits when the content processing subsystem can use a controlled term list as well as the automated methods baked into the indexer.

In this TechGrowingPains write up, the author says:

A little known, and interesting, feature in SharePoint search is the ability to create customized thesaurus word sets. The word sets can either be synonyms, or word replacements, augmenting search functionality. This ability is not limited to single words, it can also be extend into specific phrases.

The article explains how controlled term lists can be used to assist a user in formulating a query. The method is called “replacement words”. The idea of suggesting terms is a good one which many users find a time saver when doing research. The synonym expansion function is mentioned as well. SharePoint can insert broader terms into a user’s query which increases or decreases the size of the result set.

The centerpiece of the article is a recipe for activating this functionality. A helpful code snippet is included as well.

If you want additional technical support, let us know. Our Search Technoologies’ team has deep experience in Microsoft SharePoint search and customization. We can implement advanced controlled term features in almost any SharePoint system.

Iain Fletcher, June 13, 2012

Autonomy Offers Automatic Classification and Taxonomy Generation

May 7, 2012

Conceptualizing the processes and methods behind the storage and organization of data in our current age ruled by unstructured content and meta-tags can prove overwhelming. We found a great source of information from Autonomy, which explains their offering of Automatic Classification and Taxonomy Generation.

With their eye on functionality, IDOL’s classification solutions help users to circumvent issues that have arisen in a time of exponential data growth.

In addition to Taxonomy Libraries and Automatic Categorization and Channels, the Autonomy Collaborative Classifier is included. Their website clearly delineates how these elements work.

The website states the following information regarding Taxonomy Libraries:

“Built by experienced knowledge engineers using best practices learned through hundreds of consulting engagements, Autonomy taxonomies let organizations rapidly deploy industry-standard taxonomies that can be combined with your corporate taxonomies or easily customized to meet company and industry-specific requirements. Each Autonomy taxonomy is based on industry standards, and built using IDOL’s conceptual analysis that provides the highest level of accuracy.”

There are a variety of taxonomies IDOL consists of ranging from biotechnology to financial services: a comprehensive solution, indeed. Overall, IDOL seems equipped to eradicate the need for time consuming intervention required in the past. But open source alternatives exist and should be considered by procurement teams.

Megan Feil, May 9, 2012

Sponsored by Ikanow

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta