Are Google and Free an Oxymoron?

September 3, 2013

I am working on my presentation for the upcoming ISS intelligence conference. One of the topics which I will be addressing is “What is possible and not possible with Google’s Index

Now don’t get the addled goose wrong. The goslings and I use Google and a number of other online services each day. The reason is that online indexing remains a hit-and-miss proposition. Today’s search gurus ignore the problem of content which is unindexable, servers which are too slow and time out, or latency issues which consign data to the big bit bucket in the back of the building. In addition, few talk about content which is intentionally deleted or moved to a storage device beyond the reach of an content acquisition system. Then there are all-too-frequent human errors which blast content into oblivion because back up devices cannot restore data. Clever programmers change a file format. The filters and connectors designed to index the content do not recognize the file type and put the document in the “look at this, dear human” folder or just skip the file type. And there are other issuers. These range from bandwidth constraints, time out settings, and software that does not work.

Does Google face a tough climb if online advertising falters? Image source: http://www.nps.gov/media/photo/gallery.htm?tagID=8362

Are these issues discussed? Not often. And when a person like the addled goose brings up these issues, the whiz kids just sigh and tell me that my dinosaur tail is going to knock over the capuchin machine in the conference room. No problem. Most search vendors are struggling to make sales, control costs, and keep the often-flawed systems running well enough so the licensee pays the invoices. Is this a recipe for excellence? Not in my old-fashioned notebook.

I read “The Internet’s Next Victim: Advertising.” I found the title interesting because I thought the Internet’s next victim was a manager who used online search results without verifying the information. The article caught my attention because if it is accurate, Google is going to be forced to make some changes. The line “Everyone agrees that advertising on the Internet is broken” is one of those sweeping generalizations I find amusing. For some folks, online advertising works reasonably well. When one considers the options advertisers have, the Internet looks like a reasonable tool for certain products and services.

Evidence of this is Google’s ability to fund everything from tryst jets to self driving automobiles. Google has, if I understand the financial reports, managed to generate about 95 percent of its revenue from online advertising. The job hunting Steve Ballmer pointed out that Google was a one-trick pony. Well, he might have been wrong about my love of Windows 8, but he was spot on with Google’s inability to generate products and services beyond advertising.

That’s why the “Next Victim” article is thought provoking. What if Salon is correct? What will Google do to generate more revenue if advertising money decreases? What will Google do if the costs of selling ads spikes by 15 percent or more?

The options for Google are plentiful; for example:

  • Raise ad rates
  • Take ads from advertisers who are now not permitted to use the Google system
  • Reduce staff, benefits, or salaries
  • Cut back on some of the investments which are essential expensive science fiction Bell Labs-type projects
  • Ramp up fees to customers.

There are other options, of course. But the easiest path to follow is to increase the number of sponsored messages and ads shown to users of Google’s most popular services. Mobile advertising is tricky because the screen is small and the graphic approach on tablets makes the clutter of the old-style desktop display look like a 1959 Cadillac tail fin.

What happens when ads take precedence over relevant, objective results? The usefulness of the search system decreases. The good news is that most users on online search systems are happy to get some information. These users believe that information in a search result page are accurate. Who needs for-fee research systems? The free results are good enough. The downside is that for the subject matter expert, the results from most free online search systems are flawed. For many of today’s professionals, this is a small price to pay for convenience. Who has time to verify search results?

Net net: if the “Next Victim” article is correct, Google may find itself facing an uphill climb. Looking at the data through Glass won’t change the outlook, however.

In my ISS talk, I will be offering several concrete suggestions to those who want to verify online results displayed in response to a predictive, personalized query.

Stephen E Arnold, September 3, 2013

Sponsored by Xenky

Maxxcat Offers SQL Connector

August 30, 2013

Specialized hardware vendor MaxxCAT offers a SQL connector, allowing their appliances to directly access SQL databases. We read about that tool, named BobCAT, at the company’s Search Connect page. We would like to note that the company’s web site has made it easier to locate their expanding range of appliances for search and storage.

Naturally, BobCAT can be configured for use with Microsoft SQL Server, Oracle, and MySQL, among other ODBC databases. The connector ‘s integration with MaxxCAT’s appliances makes it easier to establish crawls and customize output using tools like JSON, HTML and SQL. The write-up emphasizes:

“The results returned from the BobCAT connector can be integrated into web pages, applications, or other systems that use the search appliance as a compute server performing the specialized function of high performance search across large data sets.

“In addition to indexing raw data, The BobCAT connector provides the capability for raw integrators to index business intelligence and back office systems from disparate applications, and can grant the enterprise user a single portal of access to data coming from customer management, ERP or proprietary systems.”

MaxxCAT does not stop with its SQL connector. Their Lynx Connector facilitates connection to their enterprise search appliances by developers, integrators, and connector foundries. The same Search Connect page explains:

“The connector consists of two components, the input bytestream and a subset of the MaxxCAT API that controls the processing of collections and the appliance.

“There are many applications of the Lynx Connector, including building plugins and connector modules that connect MaxxCAT to external software systems, document formats and proprietary cloud or application infrastructure. Users of the Lynx Connector have a straightforward path to take advantage of MaxxCAT’s specialized and high performance retrieval engine in building solutions.”

Developers interested in building around the Lynx framework are asked email the company for more information, including a line on development hardware and support resources. MaxxCAT was founded in 2007 to capitalize on the high-performance, specialized hardware corner of the enterprise search market. The company manages to offer competitive pricing without sacrificing its focus on performance, simplicity, and ease of integration. We continue to applaud MaxxCAT’s recently launched program for nonprofits.

Cynthia Murrell, August 30, 2013

Sponsored by ArnoldIT.com, developer of Augmentext

How Redundancy Can be a Competitive Advantage in eCommerce

August 28, 2013

The recent SLI Systems article, “In eCommerce Be, really, really redundant” makes the argument that, unlike most situations, in cloud computing redundancy can be quite beneficial. This is because it prevents downtime, a known cause of inefficiency. Therefore, redundancy is actually a competitive advantage.

The article explains:

“Downtime is especially detrimental in eCommerce; online buyers can be ruthless when they encounter it. Surveys by Akamai and Gomez.com show that among shoppers who have trouble with a web site’s performance, 79% will never return to buy from that site again. Plus, 44% say they would tell a friend about their poor experience. Even a few minutes of downtime can result in dozens of lost customers on an ordinary day. Imagine the effect of downtime during a peak shopping day like Cyber Monday!”

The article goes on to explain other situations where redundancy has been used to prevent both natural and technological disasters. While redundancy may be a plus for eCommerce businesses, how will it impact Google’s indexing?

Jasmine Ashton, August 28, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Oracle Focuses On New Full Text Query

August 26, 2013

Despite enterprise companies moving away from SQL databases to the more robust NoSQL, Oracle has updated its database to include new features, including a XQuery Full Text search. We found an article that examines how the new function will affect Oracle and where it seems to point. The article from Amis Technology Blog: “Oracle Database 12c: XQuery Full Text” explains that the XQuery Full Text search was made to handle unstructured XML content. It does so by extending the XQuery XMLDB language. This finally makes Oracle capable of working with all types of XML. The rest of the article focuses on the XQuery code.

When the new feature was used on Wikipedia Content with XML content as well the test results were positive:

“During tests it proved very fast on English Wikipedia content (10++ Gb) and delivered the results within less than a second. But such a statement will only be picked up very efficiently if the new, introduced in 12c, corresponding Oracle XQuery Full-Text Index has been created.”

Oracle is trying to improve its technology as more of its users switch over to NoSQL databases. Improving the search function as well as other features keeps Oracle in the competition as well as proves that relational tables still have some kick in them. Interestingly enough Oracle appears to be focusing its energies on MarkLogic’s technology to keep in the race.

Whitney Grace, August 26, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Next Generation Content Processing: Tail Fins and Big Data

August 19, 2013

Note: I wrote this for Homeland Security Today. It will appear when the site works out its production problems. As background, check out “The Defense Department Thinks Troves of Personal Data Pose a National Security Threat.” If the Big Data systems worked as marketers said, the next generation systems would these success stories provide ample evidence of the value of these Big Data systems?]

Next-generation content processing seems, like wine, to improve with age. Over the last four years, smart software has been enhanced by design. What is your impression of the eye-popping interfaces from high-profile vendors like Algilex, Cybertap, Digital Reasoning, IBM i2, Palantir, Recorded Future, and similar firms? ((A useful list is available from Carahsoft at http://goo.gl/v853TK.)

For me, I am reminded of the design trends for tail fins and chrome for US automobiles in the 1950s and 1960s. Technology advances in these two decades moved forward, but soaring fins and chrome bright work advanced more quickly. The basics of the automobile remained unchanged. Even today’s most advanced models perform the same functions as the Kings of Chrome of an earlier era. Eye candy has been enhanced with creature comforts. But the basics of today’s automobile would be recognized and easily used by a driver from Chubby Checker’s era. The refrain “Let’s twist again like we did last summer” applies to most of the advanced software used by law enforcement and the intelligence community.

[Image file: tailfin.png]

clip_image001

The tailfin of a 1959 Cadillac. Although bold, the tailfins of the 1959 Plymouth Fury and the limited production Superbird and Dodge Daytona dwarfed GM’s excesses. Source: https://en.wikipedia.org/wiki/File:Cadillac1001.jpg

Try this simple test. Here are screenshots from five next-generation content processing systems. Can you match the graphics with the vendor?

Here are the companies whose visual outputs appear below. Easy enough, just like one of those primary school exercises, simply match the interface with the company

The vendors represented are:

A Digital Reasoning (founded in 2000 funded in part by SilverLake. The company positions itself as providing automated understanding as did Autonomy, founded in 1996)

B IBM i2 (industry leader since the mid 1990s)

C Palantir (founded a decade ago with $300 million in funding by Founders fund, Glynn Capital Management, and others)

D Quid (a start up funded in part by Atomico, SV Angel, and others)

E Recorded Future (funded in part by In-Q-Tel and Google, founded by the developer of Spotfire)

Read more

Another Information Priority: Legacy Systems

August 16, 2013

The hoohah about cloud computing, Big Data, and other “innovations” continues. Who needs Oracle when one has Hadoop? Why license SPSS or some other Fancy Dan analytics system when there are open choice analytics systems a mouse click away? Search? Lots of open source choices.

image

Image from http://sageamericanhistory.net/gildedage/topics/gildedage3.html

We have entered the Gilded Age of information and data analysis. Do I have that right?

The marketers and young MBAs chasing venture funding instead of building revenue shout, “Yes, break out the top hats and cigars. We are riding a hockey stick type curve.”

Well, sort of. I read “Business Intelligence, Tackling Legacy Systems Top Priorities for CIOs.” Behind the consultant speak and fluff, there lurk two main points:

  1. Professionals in the US government and I presume elsewhere are struggling to make sense of “legacy” data; that is, information stuffed in file cabinets or sitting in an antiquated system down the hall
  2. The problems information technology managers remain unresolved. After decades of effort by whiz kids, few organizations can provide basic information technology services.

As one Reddit thread made clear, most information technology professionals use Google to find a fix or read the manual. See Reddit and search for “secrets about work business”.

A useful comment about the inability to tap data appears in “Improving business intelligence and analytics the top tech priority, say Government CIOs.” Here’s the statement:

IT contracts expert Iain Monaghan of Pinsent Masons added: “Most suppliers want to sell new technology because this is likely to be where most of their profit will come from in future. However, they will have heavily invested in older technology and it will usually be cheaper for them to supply services using those products. Buyers need to balance the cost they are prepared to pay for IT with the benefits that new technology can deliver,” he said. “Suppliers are less resistant to renegotiating existing contracts if buyers can show that there is a reason for change and that the change offers a new business opportunity to the supplier. This is why constant engagement with suppliers is important. The contract is meant to embody a relationship with the supplier.”

Let me step back, way back. Last year my team and I prepared a report to tackle this question, “Why is there little or no progress in information access and content processing?”

We waded through the consultant chopped liver, the marketing baloney, and the mindless prose of thought leaders. Our finding was really simple. In fact, it was so basic we were uncertain about a way to present it without coming across like a stand up comedian at the Laugh House. To wit:

Computational capabilities are improving but the volume of content to be processed is growing rapidly. Software which could cope with basic indexing and statistical chores bottlenecks in widely used systems. As a result, the gap between what infrastructure and software can process and the amount of data to be imported, normalized, analyzed, and output is growing. Despite recent advances, most organizations are unable to keep pace with new content and changes to current content. Legacy content is in most cases not processed. Costs, time, and tools seem to be an intractable problem.

Flash forward to the problem of legacy information. Why not “sample” the data and use that? Sounds good. The problem is that even sampling is fraught with problems. Most introductory statistics courses explain the pitfalls of flawed sampling.

How prevalent is use of flawed sampling? Some interesting examples from “everywhere” appear on the American Association for Public Opinion Research. For me, I just need to reflect on the meetings in which I have participated in the last week or two, Examples:

  1. Zero revenue because no one matched the “product” to what the prospects wanted to buy
  2. Bad hires because no one double checked references. The excuse was, “Too busy” and “the system was down.”
  3. Client did not pay because “contracts person could not find a key document.”

Legacy data? Another problem of flawed business and technology practices. Will azure chip consultants and “motivated” MBAs solve the problem? Nah.Will flashy smart software be licensed and deployed? Absolutely. Will the list of challenges be narrowed in 2014? Good question.

Stephen E Arnold, August 16, 2013

Sponsored by Xenky

Basho Releases Another Riak

August 16, 2013

Without further ado from Basho.com,“Basho Announces Availability Of Riak 1.4,” the popular NoSQL database. Technology news Web sites have been reeling about the new Riak upgrade and what it will offer its users. According to the article, version 1.4 offers more functionality, resolves issues, and adds functions as requested by its users. Also it gives a small taste of what to expect in version 2.0 that will be available for download later in 2013.

Here is what the upgrade features:

· Secondary Indexing Improvements: Query results are now sorted and paginated, offering developers much richer semantics

· Introducing Counters in Riak: Counters, Riak’s first distributed data type, provide automatic conflict resolution after a network partition

· Simplified Cluster Management With Riak Control: New capabilities in Riak’s GUI-based administration tool improve the cluster management page for preparing and applying changes to the cluster

· Reduced Object Storage Overhead: Values and associated metadata are stored and transmitted using a more compact format, reducing disk and network overhead

· Handoff Progress Reporting: Makes operating the cluster, identifying and troubleshooting issues, and monitoring the cluster simpler

· Improved Backpressure: Riak responds with an overload message if a vnode has too many messages in queue

Users will be happy with how Riak 1.4 will provide better functionality and management for clusters and datacenters. The updates and the 2.0 sample are enough to be excited about. There does not seem to be a thing NoSQL databases can do.

Whitney Grace, August 16, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Search and Null: Not Good News for Some

August 3, 2013

I read “How Can I Pass the String ‘Null’ through WSDL (SOAP)…” My hunch is that only a handful of folks will dig into this issue. Most senior managers buy the baloney generated by search and content processing. Yesterday I reviewed for one of the outfits publishing my “real” (for fee) columns a slide deck stuff full of “all’s” and “every’s”. The message was that this particular modern system which boasted a hefty price tag could do just about anything one wanted with flows of content.

Happily overlooked was the problem of a person with a wonky name. Case in point: “Null”. The link from Hacker News to the Stackoverflow item gathered a couple of hundred comments. You can find these here. If you are involved in one of the next-generation, super-wonderful content processing systems, you may find a few minutes with the comments interesting and possibly helpful.

My scan of the comments plus the code in the “How Can I” post underscored the disconnect between what people believe a system can do and what a here-and-now system can actually do. Marketers say one thing, buyers believe another, and the installed software does something completely different.

Examples:

  1. A person’s name—in this case ‘Null’—cannot be located in a search system. With all the hoo-hah about Fancy Dan systems, is this issue with a named entity important? I think it is because it means that certain entities may not be findable without expensive, time-consuming human curation and indexing. Oh, oh.
  2. Non English names pose additional problems. Migrating a name in one language into a string that a native speaker of a different language can understand introduces some problems. Instead of finding one person, the system finds multiple people. Looking for a batch of 50 people each incorrectly identified during processing generates a lot of names which guarantees more work for expensive humans or many, many false drops. Operate this type of entity extraction system a number of times and one generates so much work there is not enough money or people to figure out what’s what. Oh, oh.
  3. Validating named entities requires considerable work. Knowledgebases today are “built automatically and on-the-fly. Rules are no longer created by humans. Rules, like some of Google’s “janitor” technology, figure out the rules themselves and then “workers” modify those rules on-the-fly. So what happens when errors are introduced via “rules.” The system keeps on truckin’. Anyone who has worked through fixing up the known tags from an smart system like Autonomy IDOL knows that degradation can set in when the training set does not represent the actual content flow. Any wonder why precision and recall scores have not improved too much in the last 20 years? Oh, oh.

I think this item about “Null” highlights the very real and important problems with assumptions about automated content processing. Whether the corpus is a telephone directory with a handful of names or the mind-boggling flows which stream from various content channels.

Buying does not solve long-standing, complicated problems in text processing. Fast talk like that which appears in some of the Search Wizards Speak interviews does not change the false drop problem.

So what does this mean for vendors of Fancy Dan systems? Ignorance on the part of buyers is one reason why deals may close. What does this mean for users of systems which generate false drops and dependent reports which are off base? Ignorance on the part of users makes it easy to use “good enough” information to make important decisions.

Interesting, Null?

Stephen E Arnold, August 3, 2013

Sponsored by Xenky

Autonomy ArcSight Tackles Security

August 2, 2013

HP Autonomy is chasing the Oracle SES angle: security for search. We took a look at the company’s pages about HAVEn, Autonomy’s latest big data platform. Regarding the security feature, ArcSight Logger, the company promises:

“With HP ArcSight Logger you can improve everything from compliance and risk management to security intelligence to IT operations to efforts that prevent insider and advanced persistent threats. This universal log management solution collects machine data from any log-generating source and unifies the data for searching, indexing, reporting, analysis, and retention. And in the age of BYOD and mobility, it enables you to comprehensively manage an increasing volume of log data from an increasing number of sources.”

More information on HAVEn can be found in the YouTube video, “Brian Weiss Talks HAVEn: Inside Track with HP Autonomy.” At the 1:34 mark, Autonomy VP Weiss briefly describes how ArcSight analyzes the data itself, from not only inside but also outside an enterprise, for security clues. For example, a threatening post in social media might indicate a potential cyber-attack. It is an interesting approach. Can HP make this a high revenue angle?

Cynthia Murrell, August 02, 2013

Sponsored by ArnoldIT.com, developer of Augmentext

Acquisition of Gigablast by Yippy Leaves Some Questions Unanswered

July 19, 2013

An article on Yahoo titled Yippy, Inc. (YIPI) to Acquire Gigablast, Inc. And Web Research Properties, LLC to Expand Consumer Search, Enterprise, and eDiscovery Products reported on the important acquisition by the young company. Yippy, Inc. is a search clustering tech company based in Florida with some innovative eDiscovery resources. Matt Wells, the founder of Gigablast states in the article,

“Gigablast and its related properties can provide advanced technologies for consumer, eDiscovery, and enterprise big data customers.  Gigabits, a related program, is the first operational enterprise class clustering program which I put into service in 2004.  Yippy’s Velocity platform was essentially based off of my original work which will allow Yippy to sell behind the firewall installations for all types of search based applications for enterprise and eDiscovery customers.”

Yippy’s Chief Executive Rich Granville claims that the acquisition will not only benefit customers through technological innovation but by low costs. He directed interested parties to a demo that might illustrate the massive potential in the merger of these companies. The demo shows that the combined indexing of billions of pages of data has already begun, although not when it will be complete. What is less clear is who is indexing what in this tie-up?

Chelsea Kerwin, July 19, 2013

Sponsored by ArnoldIT.com, developer of Augmentext

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta