Fake Content: SEO or Tenure Desperation

November 23, 2014

This morning I thought briefly about “Profanity Laced Academic Paper Exposes Scam Journal.” The Slashdot item comments about a journal write up filled with nonsense. The paper was accepted by the International Journal of Advanced Computer Technology. I have received requests for papers from similar outfits. I am not interested in getting on a tenure track. The notion of my paying someone to publish my writings does not resonate. I either sell my work or give it away in this blog or one of the others I have available to me.

The question in my mind ping ponged between two different ways to approach this “pay to say” situation.

First, the authors who are involved in academic pursuits: “Are these folks trying to get the prestige that comes from publishing in an academic journal?” My hunch is that the motivation is similar to the force that drives the fake data people.

Second, has the search engine optimization crowd infected otherwise semi-coherent individuals that a link—any link—is worth money?

Indexing systems have a spotty record of identifying weaponized, shaped, or distorted information. The fallback position for many vendors is that by processing large volumes of information, the outliers can be easily tagged and either ignored or disproved.

Sounds good. Does it work? Nope. The idea that open source content is “accurate” may be a false assumption. You can run queries on Bing, iSeek, Google, and Yandex for yourself. Check out information related to the Ebola epidemic or modern fighter aircraft. What’s correct? What’s hoo hah? What’s downright craziness? What’s filtered? Figuring out what to accept as close to the truth is expensive and time consuming. Not part of today’s business model in most organizations I fear.

Stephen E Arnold, November 23, 2014

More Metadata: Not Needed Metadata

November 21, 2014

I find the metadata hoo hah fascinating. Indexing has been around a long time. If one wants to dig into the complexities of metadata, you may find the table from InfoLibCorp.com helpful:

image

Mid tier consulting firms often do not use the products or systems their “experts” recommend. Consultants in indexing do create elaborate diagrams that make my eyes glaze over.

Some organizations generate metadata without considering what is required. As a result, outputs from the systems can present mind boggling complex options to the user. A report displaying multiple layers  of metadata can be difficult to understand.

My thought is that before giving the green light to promiscuous metadata generation, some analysis and planning may be useful. The time lost trying to figure out which metadata is relevant to a particular issue can be critical.

But consultants and vendors are indeed impressed with flashy graphics. Too many times no one has a clue what the graphics are trying to communicate. The worst offenders are companies that sell visual sizzle to senior managers. The goal is a gasp from the audience when the Hollywood style visualizations are presented. Pass the popcorn. Skip the understanding.

Stephen E Arnold, November 21, 2014

Deciding Between SharePoint Online or On Premises Versions

November 17, 2014

Though the relevancy of on-premises installations of SharePoint is dwindling, it might still be the right choice for some organizations. SearchContentManagement.com shares key differences between the two versions in, “SharePoint Online Vs. On-Premises Is Already an Outmoded Question” (registration required.) The write-up cautions that Microsoft is bound to take SharePoint entirely into the cloud, perhaps as early as 2016, but lays out the facts so readers can judge whether a local installation would best suit them in the meantime.

On the subject of Search functionality, the write-up reports:

“Both SharePoint on-premises and Online have search capabilities. The big difference is what their search indexes can include. Typically, when the phrase enterprise search is used, it means that the search engine in question can index multiple, disparate content sources.

“In the case of SharePoint on-premises, this is true. SharePoint has long been capable of indexing SharePoint content, as well as content stored on file shares, Exchange, websites and Lotus Notes databases, among various content sources. Starting in 2007, Microsoft added the capability of indexing structured data from databases and other applications through the then-called Business Data Catalog. That feature has since matured and is now called Business Connectivity Services (BCS), and it allows virtually the same capabilities.

“The same isn’t true of SharePoint Online. The search engine can index all content stored in SharePoint and sources connected through BCS, but not index file shares, other websites or Lotus Notes databases. While the capability is largely constrained based on where SharePoint Online is hosted, the more fundamental difference is the controls available to administrators; the ability to define other content sources, like on-premises implementations, simply doesn’t exist.”

That’s disappointing. The article also contrasts the products in the areas of business data, custom development, and the relationship to its cloud service Azure. It goes on to describe a pattern of Microsoft “deconstructing” its on-premises products into individual services available through Azure, a trend that effectively turns search functionality into a stand-alone product that can be integrated into other applications. Eventually, the piece suggests, Microsoft may completely deconstruct SharePoint into a selection of Azure services. Perhaps. But will companies ever get their access to additional content sources back?

Cynthia Murrell, November 17, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

LinkedIn Enterprise Search: Generalizations Abound

November 11, 2014

Three or four days ago I received a LinkedIn message that a new thread had been started on the Enterprise Search Engine Professionals group. You will need to be a member of LinkedIn and do some good old fashioned brute force search to locate the thread with this headline, “Enterprise Search with Chinese, Spanish, and English Content.”

The question concerned a LinkedIn user information vacuum job. A member of the search group wanted recommendations for a search system that would deliver “great results with content outside of English.” Most of the intelligence agencies have had this question in play for many years.

The job hunters, consultants, and search experts who populate the forum do not step forth with intelligence agency type responses. In a decision making environment when inputs in a range of language are the norm for risk averse, the suggestions offered to the LinkedIn member struck me as wide of the mark. I wouldn’t characterize the answers as incorrect. Uninformed or misinformed are candidate adjectives, however.

One suggestion offered to the questioner was a request to define “great.” Like love and trust, great is fuzzy and subjective. The definition of “great”, according the expert asking the question, boils down to “precision, mainly that the first few results strike the user as correct.” Okay, the user must perceive results as “correct.” But as ambiguous as this answer remains, the operative term is precision.

In search, precision is not fuzzy. Precision has a definition that many students of information retrieval commit to memory and then include in various tests, papers, and public presentations. For a workable definition, see Wikipedia’s take on the concept or L. Egghe’s “The Measures Precision, Recall, Fallout, and Miss As a function of the Number of Retrieved Documents and Their Mutual Interrelations, Universiiteit Antwerp, 2000.

In simple terms, the system matches the user’s query. The results are those that the system determines containing identical or statistically close results to the user’s query. Old school brute force engines relied on string matching. Think RECON. More modern search systems toss in term matching after truncation, nearness of the terms used in the user query to the occurrence of terms in the documents, and dozens of other methods to determine likely relevant matches between the user’s query and the document set’s index.

With a known corpus like ABI/INFORM in the early 1980s, a trained searcher testing search systems can craft queries for that known result set. Then as the test queries are fed to the search system, the results can be inspected and analyzed. Running test queries was an important part of our analysis of a candidate search system; for example, the long-gone DIALCOM system or a new incarnation of the European Space Agency’s system. Rigorous testing and analysis makes it easy to spot dropped updates or screw ups that routinely find their way into bulk file loads.

Our rule of thumb was that if an ABI/INFORM index contained a term, a high precision result set on SDC ORBIT would include a hit with that term in the respective hit. If the result set did not contain a match, it was pretty easy to pinpoint where the indexing process started dropping files.

However, when one does not know what’s been indexed, precision drifts into murkier areas. After all, how can one know if a result is on point if one does not know what’s been indexed? One can assume that a result set is relevant via inspection and analysis, but who has time for that today. That’s the danger in the definition of precision in what the user perceives. The user may not know what he or she is looking for. The user may not know the subject area or the entities associated consistently with the subject area. Should anyone be surprised when the user of a system has no clue what a system output “means”, whether the results are accurate, or whether the content is germane to the user’s understanding of the information needed.

Against this somewhat drab backdrop, the suggestions offered to the LinkedIn person looking for a search engine that delivers precision over non-English content or more accurately content that is not the primary language of the person doing a search are revelatory.

Here are some responses I noted:

  • Hire an integrator (Artirix, in this case) and let that person use the open source Lucene based Elasticsearch system to deliver search and retrieval. Sounds simplistic. Yep, it is a simple answer that ignores source language translation, connectors, index updates, and methods for handling the pesky issues related to how language is used. Figuring out what a source document in an language with which the user is not fluent is fraught with challenges. Forget dictionaries. Think about the content processing pipeline. Search is almost the caboose at the end of a very long train.
  • Use technology from LinguaSys. This is a semantic system that is probably not well known outside of a narrow circle of customers. This is a system with some visibility within the defense sector. Keep in mind that it performs some of the content processing functions. The technology has to be integrated into a suitable information retrieval system. LinguaSys is the equivalent of adding a component to a more comprehensive system. Another person mentioned BASIS Technologies, another company providing multi language components.
  • Rely on LucidWorks. This is an open source search system based on SOLR. The company has spun the management revolving door a number of times.
  • License Dassault’s Exalead system. The idea is wroth considering, but how many organizations are familiar with Exalead or willing to embrace the cultural approach of France’s premier engineering firm. After years of effort, Exalead is not widely known in some pretty savvy markets. But the Exalead technology is not 100 percent Exalead. Third party software delivers the goods, so Exalead is an integrator in my view.
  • Embrace the Fast Search & Transfer technology, now incorporated into Microsoft SharePoint. Unmentioned is the fact that Fast Search relied on a herd of human linguists in Germany and elsewhere to keep its 1990s multi lingual system alive and well. Fast Search, like many other allegedly multi lingual systems, rely on rules and these have to be written, tweaked, and maintained.

So what did the LinkedIn member learn? The advice offers one popular approach: Hire an integrator and let that company deliver a “solution.” One can always fire an integrator, sue the integrator, or go to work for the integrator when the CFO tries to cap the cost of system that must please a user who may not know the meaning of nus in Japanese from a now almost forgotten unit of Halliburton.

The other approach is to go open source. Okay. Do it. But as my analysis of the Danish Library’s open source search initiative in Online suggested, the work is essentially never done. Only a tolerant government and lax budget oversight makes this avenue feasible for many organizations with a search “problem.”

The most startling recommendation was to use Fast Search technology. My goodness. Are there not other multi lingual capable search systems dating from the 1990s available? Autonomy, anyone?

Net net: The LinkedIn enterprise search threads often underscore one simple fact:

Enterprise search is assumed to be one system, an app if you will.

One reason for the frequent disappointment with enterprise search is this desire to buy an iPad app, not engineer a constellation of systems that solve quite specific problems.

Stephen E Arnold,November 11, 2014

Insights from Search Pro Dave Hawking

November 7, 2014

Search-technology expert Dave Hawking is now working with Microsoft to improve Bing. Our own Stephen Arnold spoke to Mr. Hawking when he was still helping propel Funnelback to great heights. Now, IDM Magazine interviews the search wizard about his new gig, some search history, and challenges currently facing enterprise search in, “To Bing and Beyond.”

Anyone interested in the future of Bing, Microsoft, or enterprise search, or in Australian computer-science history, should check out the article. I was interested in this bit Hawking had to say about ways that tangled repository access can affect enterprise search:

“Access controls for particular repositories are often out of date, inappropriate, and inconsistent, and deployment of enterprise search exposes these problems. They can arise from organisational restructuring, staff changes or knee-jerk responses to unauthorised accesses. As there are usually a large number of repositories, rationalising access controls to ensure that search results respect policies is a lot of work.

“Organisations vary widely in their approach to security: some want security enforced with early binding (recording permissions at indexing time), others want late binding, where current permissions are applied when query result are displayed, or a hybrid of the two.

“This choice has a major impact on performance. Another option is ‘translucency’, where users may see the title of a document but not its content, or receive an indication that documents matching the query exist but that they need to request permission to access them. As well these security model variations, organisations vary in their requirements for customization, integration and presentation, and how results from multiple repositories should be prioritized, tending to make enterprise search projects quite complex.”

Eventually, standards and best practices may spread that will reduce these complexities. Then again, perhaps technology now changes too fast for such guidelines to take root. For now, at least, experts who can skillfully navigate this obstacle-strewn field will continue to command a pretty penny.

Cynthia Murrell, November 07, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Enterprise Search: Is It Really a Loser?

November 5, 2014

I read “Enterprise Search: Despite Benefits, Few Organizations Use Enterprise Search.” The headline caught my attention. In my experience, most organizations have information access systems. Let me give you several recent examples:

  • US government agency. This agency licenses technology from a start up called Red Owl Analytics. That system automatically gathers and makes available information pertinent to the licensing agency. One of the options available to the licensee is to process information that is available within the agency. The system generates outputs and there are functions that allow a user to look for information. I am reasonably confident that the phrase “enterprise search” would not be applied to this company’s information access system. Because Red Owl fits into a process for solving a business problem, the notion of “enterprise search” would be inappropriate.
  • Small accounting firm. This company uses Microsoft Windows 7. The six person staff uses a “workgroup” method that is easy to set up and maintain. The Windows 7 user can browse the drives to which access has been granted by the part time system administrator. When a person needs to locate a document, the built in search function is used. The solution is good enough. I know that when Windows-centric, third party solutions were made known to the owner, the response was, “Windows 7 search is good enough.”
  • Large health care company with dozens of operating units. The company has been working to integrate certain key systems. The largest on-going project is deploying a electronic health care system. Each of the units has legacy search technology. The most popular search systems are those built into applications used every day. Database access is provided by these applications. One unit experimented with a Google Appliance and found that it was useful to the marketing department. Another unit has a RedDot content management system and has an Autonomy stub. The company has no plans, as I understand it, to make federated enterprise search a priority. There is no single reason. Other projects have higher priority and include a search function.

If my experience is representative (and I am not suggesting what I have encountered is what you will encounter), enterprise search is a tough sell. When I read this snippet, I was a bit surprised:

Enterprise search tools are expected to improve and that may improve uptake of the technology.  Steven Nicolaou, Principal Consultant at Microsoft, commented that “enterprise search products will become increasingly and more deeply integrated with existing platforms, allowing more types of content to be searchable and in more meaningful ways. It will also become increasingly commoditized, making it less of a dark art and more of a platform for discovery and analysis.”

What this means is that the company that provides “good enough” search baked into an operating system (think Windows) or an application (think about the search function in an electronic health record), there will be little room for a third party to land a deal in most cases.

The focus in enterprise search has been off the mark for many years. In fact, today’s vendors are recycling the benefits and features hawked 30 years ago. I posted a series of enterprise search vendor profiles at www.xenky.com/vendor-profiles. If you work through that information, you will find that the marketing approaches today are little more than demonstrations of recycling.

The opportunity in information access has shifted. The companies making sales and delivering solid utility to licensees are NOT the companies that beat the drum for customer support, indexing, and federated search.

The future belongs to information access systems that fit into mission critical business processes. Until the enterprise search vendors embrace a more innovative approach to information access, their future looks a bit cloudy.

In cooperation with Telestrategies, we may offer a seminar that talks about new directions in information access. The key is automation, analytics, and outputs that alert, not the old model of fiddling with a query until the index is unlocked and potentially useful information is available.

If you want more information about this invitation only seminar, write me at seaky2000 at yahoo dot com, and I will provide more information.

Stephen E Arnold, November 5, 2014

Altegrity Kroll: Under Financial Pressure

October 30, 2014

Most of the name surfing search experts—like the fellow who sold my content on Amazon without my permission and used my name to boot—will not recall much about Engenium. That’s no big surprise. Altegrity Kroll owns the pioneering company in the value-added indexing business. Altegrity, as you may know, is the owner of the outfit that cleared Edward Snowden for US government work.

I read “Snowden Vetter Altegrity’s Loans Plunge: Distressed Debt”. In that article I learned:

Altegrity Inc., the security firm that vetted former intelligence contractor Edward Snowden, has about six months until it runs out of money as the loss of background-check contracts negate most of a July deal with lenders to extend maturities for five years.

The article reports that “selective default” looms for the company. With the lights  flickering at a number of search and content processing firms, I hope that the Engenium technology survives. The system remains a leader in a segment which has a number of parvenus.

Stephen E Arnold, October 30, 2014

Google Scholar and Google Silos of Content

October 18, 2014

I read “Making the World’s Problem Solvers 10% More Efficient.” The article explains that the Google engineer who was “the key inventor” of Google Scholar is leaving the GOOG.

The write up discloses a couple of interesting factoids; for example:

  • Google Scholar has been around for 10 years
  • The founder of Google Scholar took charge of Google’s indexing in year 2000
  • The inventor of Google Scholar had to figure out how to keep Google’s index fresh; that is, new and changed content are reflected in search results.

The most interesting point in the write up is this statement (I have added the boldface):

Also, the nature of academic papers presented some opportunities for more powerful ranking, particularly making use of the citations typically included in academic papers. Those same scholarly citations had been the original inspiration for PageRank, the technique that had originally made Google search more powerful than its competitors. Scholar was able to use them to effectively rank articles on a given query, as well as to identify relationships between papers.

What happened to Eugene Garfield? I know, “Who?” So does this passage mean that today’s Google Web search discards functionality originally included in year 2000?

But the big point for me is that Google is supposed to deliver “universal search.” To make use of Google Scholar, one must navigate to http://scholar.google.com and run separate queries. Is this universal? It seems to be old school siloing.

I like Google Scholar, but I think Google Web search may lack some of the refinements included in Google Scholar. Well, ads are important. Correction: Revenue is important. Perhaps Google will charge for access to Google scholar and compete directly with commercial database vendors? In my view, Google Scholar had a negative impact on commercial database vendors who charge libraries, corporations, and individual for access to curated and indexed professional and scholarly information. Google seems content to allow the Google Scholar service to drift along. Would more purpose be of value? Queries for patent 2012/0251502 A1’s “the isolated nucleic acid molecule includes the nucleotide sequence of SEQ ID NOs: 1 or 10, or a complement thereof. In another, the nucleic acid molecule includes a nucleotide sequence having at least 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 4600, 4700, 4800, or 4900 contiguous nucleotides of the nucleotide sequence of SEQ ID NO: 1” would permit Google to match Ebola ads to Google Scholar content?

Stephen E Arnold, October 18, 2014

New IBM Redbook: IBM Watson Enterprise Search and Analytics

October 12, 2014

The Redbook is free. You can download it from this IBM link for now. The full title is “IBM Watson Content Analytics. Discovering Actionable Insight from Your Content.”

The Redbook weighs in with 598 pages of Watson goodness. If you follow the IBM content analytics products, you may know that the previous version was know as IBM Content Analytics with Enterprise Search or (ICAwES).

The Redbook presents some philosophical content. IBM has a tradition to uphold. In addition, the Redbook provides information about facets (yep, good old metadata), some mathy features that make analytics analytical, and sentiment analysis.

ICAwES does not operate as an island. The sprawling system can hook into IBM’s semi automatic classification system, Cognos, and interface tools.

Is ICAwES an “enterprise search” system? I would say, “Sure is.” You will have to work through the Redbook and draw your own conclusions. You will also want to identify the Watson component. Watson is Lucene with IBM scripts and wrappers, but IBM has far more colorful lingo for describing the system. After all, IBM Watson is supposed to generate $1 billion in a snappy manner. If IBM’s plan bears revenue fruit, in five or six years, Watson will be a $10 billion per year business. That’s quite a goal, considering Autonomy required 13 years to push into $800 million in revenue territory and IBM has been offering information retrieval systems since the days of STAIRS.

The new information in the July 2014 edition of the Redbook adds a chapter containing some carefully selected case studies. There is a new chapter called “Enterprise Search” to which I will return in a moment. Also, the many authors of the Redbook have added to the discussion of Cognos, one of IBM’s business intelligence systems. Finally, the Redbook provides some helpful suggestions for “customizing and extending the content analytics miner.”

I urge you to work through this volume because it provides a useful yardstick against which to measure the IBM Watson marketing and public relations explanations against the reality, limitations, and complexity of the IBM Content Analytics system. Is the Redbook describing a product or a collection of components that an IBM implementation team will use to craft a customized solution?

The chapter on Enterprise Search begins on page 445 and continues to page 486. The solution is a two part affair. On one hand, processed content will output data about the entities, word frequencies, and similar metrics in the corpus and updates to the corpus. On the other hand, ICAwES is a search and retrieval system. Many vendors take this approach today; however, certain types of content cannot be comprehensively processed by the system. Examples range from video content, engineering drawings, digital imagery, and certain types of ephemeral content such as text messages sent via an ad hoc Bluetooth mesh network. One can code up a fix, but that is likely to be more hassle than many licensees will tolerate.

The Redbook shows some ready-to-use interfaces. These can, of course, be modified. The sample in the screenshot below looks quite a bit like the original Fulcrum Technologies’ presentation of information processed by the system. A more modern implementation would be Amazon’s recent JSON centric system for content.

image

ICAwES Redbook, Copyright IBM 2014.

The illustration shows a record viewed by tags; for example categories. Items can be tallied in a chart that provides a summary of how many content objects share a particular index terms. The illustration shows the ICAwES identifying terms in a user’s query, identifying entities like IBM Lotus Domino, and other features associated with Autonomy IDOL or Endeca style systems. Both of these date from the late 1990s, so IBM is not pushing too far from the dirt path carved out of the findability woods by former leaders in enterprise search.

IBM provides information needed to implement query expansion. Yes, a dictionary lurks within the system, and an interface is provided so the licensee can be like Noah Webster. The system is rules based, and a specialist is needed to create or edit rules. As you may know, rules based systems suffer from several drawbacks. Rules have to be maintained, subject matter experts or programmers are usually required to make the proper judgments, and rules can drift out of phase with the users’ queries unless the system is monitored with above average rigor. Like Autonomy IDOL, skimp on monitoring and tuning, and the system can generate some interesting results.

The provided user interface looks like this:

image

ICAwES Redbook, Copyright IBM 2014.

With many users wanting a “big red button” to simplify information access, this interface brings forward the high density displays associated with TeraText and similar legacy systems. The density seems to include hints of Attivio and BA Insight user interfaces as well. There are many choices available to the user. However, without special training, it is unlikely that a marketing professional using ICAwES will be able to make full use of of query trees, category trees, and the numerous icons that appear in four different locations. I can hear the user now, “I want this system to be just like Google? I want to type in a three words and scan the results.”

Net net. If you are working in an organization that favors IBM solutions, this system is likely to be what senior management licenses. Keep in mind that ICAwES will require the ministrations of IBM professional services, probably additional headcount, and on-going work to keep the system delivering useful results to users and decision makers.

The system delivers key word search, rich indexing, and basic metrics about the content. IBM offers more robust analytic tools in its SPSS product line. For more comprehensive text analysis, take a look at IBM i2 and Cybertap solutions if your organization has appropriate credentials for these somewhat more sophisticated information access and analysis systems.

After working through the Redbook, I had one question, “Where’s Watson?”

Stephen E Arnold, October 12, 2014

SRCH2: Security and Speed

October 12, 2014

Oracle’s Secure Enterprise Search offered advanced security. Perfect Search stressed its speed. SES has been marginalized. That particular security pitch did not work. Perfect Search also has faded from the scene.

Perhaps pitching both security and speed will yield more together than as separate features.

SRCH2 asserts that it is four times faster than open source search engines. None of the open source search engines is a speed demon. Speed boosts require additional work on the specific subsystem introducing the latency for a particular deployment.

SRCH2’s “Real Time Computer Requires Faster Search” makes a case for the optimization built in to SRCH2’s system. The article states:

SRCH2 offers the world’s fastest search engine. Why is speed so important? After all, the human eye can’t detect the difference between a 10-millisecond and 50-millisecond response time.

Some data backing this assertion would be helpful. In a direct comparison of Lucid Works’ technology with ElasticSearch’s technology, the ArnoldIT team found that one was faster in indexing and the other was faster in query processing. Both could be improved with focused optimization. Perhaps SRCH2 will share some of their data which backs up the “four time faster claim? (I am not at liberty to release the performance data a client requested my team compile from live tests on my test corpus.

SRCH2’s “SRCH2 Introduces Access Control Lists to Improve Search Security.” The article states:

SRCH2 took the approach of providing native support of access control to set restrictions on search results. With SRCH2’s ACL feature, developers can restrict user permissions to access either certain records in an index, or specific attributes within a record or set of records.

The approach is useful. However, it is less robust that the Oracle approach which implemented a wider range of features provided by specialized Oracle subsystems.

Will the combination of security and speed pay off for SRCH2? Good question. I do not have an answer.

Stephen E Arnold, October 11, 2014

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta