Concept Searching SharePoint White Paper

October 22, 2015

I saw a reference to “2015 SharePoint and Office 365 State of the Market Survey White Paper.” If you are interested in things SharePoint and Office 365, you can (as of October 15, 2015) download the 40 page document at this Concept Searching link. A companion webinar is also available.

The most interesting portion of the white paper is its Appendix A. A number of buzzwords are presented as “Priorities by Application.” Note that the Appendix is graphical and presents the result of a “survey.” Goodness, SharePoint seems to have some holes in its digital fabric.

The data for enterprise search are interesting.

image

Source: Concept Searching, 2015

It appears that fewer than 20 percent of those included in the sample (not many details about the mechanics of this survey the data for which was gathered via the Web) do not see enterprise search as a high priority issue. About 30 percent of the respondents perceive search as working as intended. An equal number, however, are beavering away to improve their enterprise search system.

Unlike some enterprise search and content processing vendors, Concept Search is squarely in the Microsoft camp. With third party vendors providing “solutions” for SharePoint and Office 365, I ask myself, “Why doesn’t Microsoft address the shortcomings third parties attack?”

Stephen E Arnold, October 22, 2015

Attensity: Discover Now

October 21, 2015

i read “Speedier Data Analysis Focus of Attensity’s DiscoverNow.” Attensity is one of the firms processing content for information signals.The company has undergone some management turnover. The company has rolled out DiscoverNow, a product that runs from the cloud and features “built in integration with the Informatica cloud.” The write up reports:

According to the company, DiscoverNow connects to more than 150 internal and external text-based data sources, including popular enterprise apps and databases such as Salesforce.com, SAP, Oracle/Siebel, Box, Concur, Dropbox, Datasift, Eloqua, JIRA, MailChimp, Marketo, NetSuite, Hadoop, MySQL and Thomson Reuters. It combines insights from these internal data sources with external text sources such as Twitter, Facebook, Google+, YouTube, Reddit, forums and review sites, to offer a robust view of customer activities.

Attensity is, according to the article, different and outperforms its competitors. According to Cary Fulbright, Attensity’s chief strategy officer:

Attensity outperforms competing text analytics systems that rely more heavily on keywords. “We parse sentences by subject, noun and object, so we can identify the context used,” he said. “For example, DiscoverNow understands the difference between the Venetian Hotel, Venetian blinds and Venetian gondolas, or ‘uber cool’ and Uber ridesharing. Our team of linguists is constantly updating our generic and industry-specific libraries with new terms, including slang.”

A number of companies offer text processing systems. Attensity is a mash up of several organizations. DiscoverNow may be the breakthrough product the company has been seeking. To date, according to Crunchbase, the company has ingested since 2000 $90 million.

Stephen E Arnold, October 21, 2015

The Tweet Gross Domestic Product Tool

October 16, 2015

Twitter can be used to figure out your personal income.  Twitter was not designed to be a tool to tally a person’s financial wealth, instead it is a communication tool based on a one hundred forty character messages to generate for small, concise delivery.  Twitter can be used to chat with friends, stars, business executives, etc, follow news trends, and even advertise products by sent to a tailored audience.  According to Red Orbit in the article “People Can Guess Your Income Based On Your Tweets,” Twitter has another application.

Other research done on Twitter has revealed that your age, location, political preferences, and disposition to insomnia, but your tweet history also reveals your income.  Apparently, if you tweet less, you make more money.  The controls and variables for the experiment were discussed, including that 5,191 Twitter accounts with over ten million tweets were analyzed and accounts with a user’s identifiable profession were used.

Users with a high follower and following ratio had the most income and they tended to post the least.  Posting throughout the day and cursing indicated a user with a lower income.  The content of tweets also displayed a plethora of “wealth” information:

“It isn’t just the topics of your tweets that’s giving you away either. Researchers found that “users with higher income post less emotional (positive and negative) but more neutral content, exhibiting more anger and fear, but less surprise, sadness and disgust.” It was also apparent that those who swore more frequently in their tweets had lower income.”

Twitter uses the information to tailor ads for users, if you share neutral posts get targeted ads advertising expensive items, while the cursers get less expensive ad campaigns.  The study also proves that it is important to monitor your Twitter profile, so you are posting the best side of yourself rather than shooting yourself in the foot.

Whitney Grace, October 16, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

Can Online Systems Discern Truth and Beauty or All That One Needs to Know?

October 14, 2015

Last week I fielded a question about online systems’ ability to discern loaded or untruthful statements in a plain text document. I responded that software is not yet very good at figuring out whether a specific statement is accurate, factual, right, or correct. Google pokes at the problem in a number of ways; for example, assigning a credibility score to a known person. The higher the score, the person may be more likely to be “correct.” I am simplifying, but you get the idea: Recycling a variant of Page Rank and the CLEVER method associated with Jon Kleinberg.

There are other approaches as well, and some of them—dare I suggest, most of them—use word lists. The idea is pretty simple. Create a list of words which have positive or negative connotations. To get fancy, you can work a variation on the brute force Ask Jeeves’ method; that is, cook up answers or statement of facts “known” to be spot on. The idea is to match the input text with the information in these word lists. If you want to get fancy, call these lists and compilations “knowledgebases.” I prefer lists. Humans have to help create the lists. Humans have to maintain the lists. Get the lists wrong, and the scoring system will be off base.

There is quite a bit of academic chatter about ways to make software smart. A recent example is “Sentiment Diffusion of Public Opinions about Hot Events: Based on Complex Network.” In the conclusion to the paper, which includes lots of fancy math, I noticed that the researchers identified the foundation of their approach:

This paper studied the sentiment diffusion of online public opinions about hot events. We adopted the dictionary-based sentiment analysis approach to obtain the sentiment orientation of posts. Based on HowNet and semantic similarity, we calculated each post’s sentiment value and classified those posts into five types of sentiment orientations.

There you go. Word lists.

My point is that it is pretty easy to spot a hostile customer support letter. Just write a script that looks for words appearing on the “nasty list”; for example, consumer protection violation, fraud, sue, etc. There are other signals as well; for example, capital letters, exclamation points, underlined words, etc.

The point is that distorted, shaped, weaponized, and just plain bonkers information can be generated. This information can be gussied up in a news release, posted on a Facebook page, or sent out via Twitter before the outfit reinvents itself.

The researcher, the “real” journalist, or the hapless seventh grader writing a report will be none the wiser unless big time research is embraced. For now, what can be indexed is presented as if the information were spot on.

How do you feel about that? That’s a sentiment question, gentle reader.

Stephen E Arnold, October 14, 2015

Artificial Intelligence: A Jargon Mandala to Understand the Universe of Search

October 12, 2015

I read “Lux: Useful Sankey Diagram on AI.” A Sankey diagram, according to Sankey Diagrams a “Sankey diagram says more than 1,000 pie charts.” The assumption is, of course, that a pie chart presents meaningful data. In the energy sector you can visual flows in complex systems. It helps to have numbers when one is working towards a Sankey map, but if real data are not close at hand, one can fudge up some data.

Here’s the Sankey diagram in the write up:

image

You can see an almost legible version at this link.

What the diagram suggests is that certain information access and content processing functions flow into data mining, machine learning, and statistics. If you are a fan of multidimensionality, the arrow of time may flow in the reverse direction; that is from data mining, machine learning, and statistics to affective computing, cognitive computing, computational discovery, image and video analytics, language translation, navigation, recommender systems, and speech recognition.

The intermediary state, tinted a US currency green provides intermediating operations or conditions; for example, anomaly detection, collaborative filtering, computer eavesdropping, computer vision, pattern recognition, NLP, path planning, clustering, deep learning, dimensionality reduction, networks graphic models, online reinforcement learning, pattern similarity, probabilistic modeling, regression, and, my favorite, search algorithms.

The diagram, like the wild and crazy chemical imagery for Watson, seems to be a way to:

  1. Collect a number of discrete operations
  2. Arrange the operations into some orderly framework
  3. Allow the viewer to perceive relationships or the potential for relationships among the operations.

In short, skip the wild and crazy presentations by search and content processing vendors about how search enables broader and, hence, more valuable activities. Search is relegated to an entry in the intermediating column of the Sankey diagram.

My thought is that some folks will definitely love the idea that the many different specialties of content processing can be presented in a mandala which invites contemplation and consideration.

The diagram makes clear that when a company wants to know what one can do with the different and often clever operatio0ns one can perform with content, the answer may be, “Make a poster and hang it on the wall.”

In terms of applications, the chart makes quite explicit that some clever team will have to put the parts in order. Does this remind you of building a Star Wars character from Lego blocks.

The construct is the value, not the individual enabling blocks.

Stephen E Arnold, October 12, 2015

Another Categorical Affirmative: Nobody Wants to Invest in Search

October 8, 2015

Gentle readers, I read “Autonomy Poisoned the Well for Businesses Seeking VC Cash.” Keep in mind that I am capturing information which appeared in a UK publication. I find this type of essay interesting and entertaining. Will you? Beats me. One thing is certain. This topic will not be fodder for the LinkedIn discussion groups, the marketers hawking search and retrieval at conferences to several dozen fellow travelers, or in consultant reports promoting the almost unknown laborers in the information access vineyards.

Why not?

The problem with search reaches back a few years, but I will add a bit of historical commentary after I highlight what strikes me as the main point of the write up:

Nobody wants to invest in enterprise search, says startup head. Patrick White, Synata

Many enterprise search systems are a bit like the USS United States, once the slickest ocean liner in the world. The ship looks like a ship, but the effort involved in making it seaworthy is going to be project with a hefty price tag. Implementing enterprise search solutions are similar to this type of ocean-going effort.

There you go. “Nobody.” A categorical in the “category” of logic like “All men are mortal.” Remarkable because outfits like Attivio, Coveo, and Digital Reasoning, among others have received hefty injections of venture capital in recent memory.

The write up makes this interesting point:

“I think Autonomy really messed up [the space]”, and when investors hear ‘enterprise search for the cloud’ it “scares the crap out of them”, he added. “Autonomy has poisoned the well for search companies.” However, White added that Autonomy was just the most high profile example of cases that have scared off investors. “It is unfair just to blame Autonomy. Most VCs have at least one enterprise search in their portfolio. So VCs tend to be skittish about it,” he [added.

I am not sure I agree. Before there was Autonomy, there was Fulcrum Technologies. The company’s marketing literature is a fresh today as it was in the 1990s. The company was up, down, bought, and merged. The story of Fulcrum, at least up to 2009 or so is available at this link.

The hot and cold nature of search and content processing may be traced through the adventures of Convera (formerly Excalibur Technologies) and its relationships with Intel and the NBA, Delphes (a Canadian flame out), Entopia (a we can do it all), and, of course, Fast Search & Transfer.

Now Fast Search, like most old school search technology, is very much with us. For a dose of excitement one can have Search Technologies (founded by some Convera wizards) implement Fast Search (now owned by Microsoft).

Where Are the Former Big Six in Enterprise Search Vendors: 2004 and 2015

Autonomy, now owned by HP and mired in litigation over allegations of financial fraud

Convera, after struggles with Intel and NBA engagements, portions of the company were sold off. Essentially out of business. Alums are consultants.

Endeca, owned by Oracle and sold as an eCommerce and business intelligence service. Oracle gives away its own enterprise search system.

Exalead, owned by Dassault Systèmes and now marketed as a product component system. No visibility in the US.

Fast Search, owned by Microsoft and still available as a utility for SharePoint. The technology dates from the late 1990s. Brand is essentially low profiled at this time.

Verity, Autonomy purchased Verity and used its customer list for upsales and used the K2 technology as part of the sprawling IDOL suite.

Fast Search reported revenues which after an investigation and court procedure were found to be a bit enthusiastic. The founder of Fast Search was the subject of the Norwegian authorities’ attention. You can check out the news reports about the prohibition on work and the sentence handed down for the issues the authorities concluded warranted a slap on the wrist and a tap on the head.

The story of enterprise search has been efforts—sometimes Herculean—to sell information access companies. When a company sells like Vivisimo for about one year’s revenues or an estimated $20 million, there is a sense of getting that mythic task accomplished. IBM, like most of the other acquirers of search technology, try valiantly to convert a utility into something with revenue lift. As I watch the evolution of the lucky exits, my overall impression is that the purchasers realize that search is a utility function. Search can generate consulting and engineering fees, but the customers want more.

That realization leads to the wild and crazy hyper marketing for products like Hewlett Packard’s cloud version of Autonomy’s IDOL and DRE technology or IBM’s embrace of open source search and the wisdom of wrapping that core with functions.

Enterprise search, therefore, is alive and well within applications or solutions that are more directly related to something that speaks to senior managers; namely, making sales and reducing costs.

What’s the cost of making sure the controls for an enterprise search system are working and doing the job the licensee wants done?

The problem is the credit card debt load which Googlers explained quite clearly. Technology outfits, particularly information access players, need more money than it is possible for most firms to generate. This contributes to the crazy flips from search to police analysis, from looking up an entry in a data base to an assertion that customer support is enabled, hunting for an article in this blog is now real time, active business intelligence, or indexing by proper noun like White House morphs into natural language understanding of unstructured text.

Investments are flowing to firms which could be easily positioned as old school search and retrieval operations. Consider Lexmark, a former unit of IBM, and an employer of note not far from my pond filled with mine run off in Kentucky. The company, like Hewlett Packard, wants to find a way to replace its traditional business which was not working as planned as a unit of IBM. Lexmark bought Brainware, a company with patents on trigram methods and a good business for processing content related to legal matters. Lexmark is doing its best to make that into a Trump scale back office content processing business. Lexmark then bought a technology dating from the 1980s (ISYS Search Software once officed in Crow’s Nest I believe) and has made search a cornerstone of the Lexmark next generation health care money spinning machine. Oracle has a number of search properties. Most of these are unknown to Oracle DBAs; for example, Artificial Linguistics, TripleHop, InQuira’s shotgun NLP technology, etc. The point is that the “brands” have not had enough magnetism to pull revenues on a stand alone basis.

Successes measured in investment dollars is not revenue. Palantir is, in effect, a search and retrieval outfit packaged as a super stealthy smart intelligence system. Recorded Future, funded by Google and In-Q-Tel, is doing a bang up job with specialized content processing. There are, remember, search and retrieval companies.

The money in search appears to be made in these plays:

  • The Fast Search model. Short cuts until an investigator puts a stop to the activities.
  • Creating a company and then selling it to a larger firm with a firm conviction that it can turn search into a big time money machine
  • Buying a search vendor to get its customers and opportunities to sell other enterprise software to those customers
  • Creating a super technology play and going after venture funding until a convenient time arrives to cash out
  • Pursue a dream for intelligent software and survive on research grants.

This list does not exhaust what is possible. There are me-too plays. There are mobile niche plays. There are apps which are thinly disguised selective dissemination of information services.

The point is that Autonomy is a member of the search and retrieval club. The company’s revenues came from two principal sources:

  1. Autonomy bought companies like Verity and video indexing and management vendor Virage and then sold other products to these firm’s clients and incorporated some of the acquired technology into products and services which allowed Autonomy to enter a new market. Remember Autonomy and enhanced video ads?
  2. Autonomy managed well. If one takes the time to speak with former Autonomy sales professionals, the message is that life was demanding. Sales professionals including partners had to produce revenue or some face time with the delightful Dr. Michael Lynch or other senior Autonomy executives was arranged.

That’s it. Upselling and intense management for revenues. Hewlett Packard was surprised at the simplicity of the Autonomy model and apparently uncomfortable with the management policies and procedures that Autonomy had been using in highly visible activities for more than a decade as a publicly traded company.

Perhaps some sources of funding will disagree with my view of Autonomy. That is definitely okay. I am retired. My house is paid for. I have no charming children in a private school or university.

The focus should be on what the method for generating revenue is. The technology is of secondary importance. When IBM uses “good enough” open source search, there is a message there, gentle reader. Why reinvent the wheel?

The trick is to ask the right questions. If one does not ask the right questions, the person doing the querying is likely to draw incorrect conclusions and make mistakes. Where does the responsibility rest? When one makes a bad decision?

The other point of interest should be making sales. Stated in different terms, the key question for a search vendor, regardless of camouflage, what problem are you solving? Then ask, “Will people pay money for this solution?”

If the search vendor cannot or will not answer these questions and provide data to be verified, the questioner runs the risk of taking the USS United States for a cruise as soon as you have refurbed the ship, made it seaworthy, and hired a crew.

The enterprise search sector is guilty of making a utility function appear to be a solution to business uncertainty. Why? To make sales. Caveat emptor.

Stephen E Arnold, October 8, 2015

IBM Defines Information Access the Madison Avenue Way

October 7, 2015

Yesterday (October 6, 2015) I wrote a little dialogue about the positioning of IBM as the cognitive computing company. I had a lively discussion at lunch after the story appeared about my suggesting that IBM was making a grand stand play influenced by Madison Avenue thinking, not nuts and bolts realities of making sales and generating revenue.

Well, let’s let IBM rejiggle the line items in its financial statements. That should allow the critics of the company to see that Watson (which is the new IBM) account for IBM revenues. I am okay with that, but for me, the important numbers are the top line revenue and profit. Hey, call me old fashioned.

In the midst of the Gartner talk about IBM, the CNBC exclusive with IBM’s Big Blue dog (maybe just like the Gartner talk and thus not really “exclusive”?), and the wall paper scale ads in the New York Times and Wall Street Journal, there was something important. I don’t think IBM recognizes what it has done for the drifting, financially challenged, and incredibly fragmented search and content processing market. Even the LinkedIn enterprise search discussion group which bristles when I quote Latin phrases to the members of the group will be revivified.

image

Indexing and groupoiing are useful functions. When applied with judgment, an earthworm of unrelated words and phrases may communicate more effectively.

To wit, this is IBM’s definition of Watson which is search based on Lucene, home brew code, and IBM acquisitions’ software:

Author extraction—Lots of “extraction” functions
Concept expansion
Concept insights—I am not sure I understand the concept functions
Concept tagging—Another concept function
Dialog—Part of NLP maybe
Entity extraction—Extraction
Face detection with the charming acronym F****d—Were the Mad Ave folks having a bit of fun?
Feed detection—Aha, image related
Image Link extraction—Aha, keeping track of urls
Image tagging—Aha, image indexing. I wonder is this is recognition or using information in the file or a caption
Keyword extraction
Language detection
Language translation
Message resonance—No clue here in Harrod’s Creek
Natural language classifier—NLP again
Personality insights—Maybe figuring out what the personality of the author of a processed file means?
Question and answer (I think this is natural language processing which incorporates many other functions in this list)—More NLP
Relationship extraction—IBM has technology from its purchase of i2 which performs this function. How does this work on disparate streams of unstructured content? I have some thoughts
Review and rank—Does this mean relevance ranking?
Sentiment analysis—Yes, is a document with the word F****d in it positive or negative
Speech to text—Seems similar to text to speech
Taxonomy—Ah, ha. A system to generate a list of controlled terms. No humans needed? Nah, humans can be billable and it is an IBM function
Text extraction—Another extraction function
Text to speech
Tone analyzer—So what is the tone of a document containing the string F****d?
Tradeoff analytics—Hmm. Now Watson is doing a type of analytics presumably performed on text? What are the thresholds in the numerical recipe? Do the outputs make sense to a normal human?
Visual recognition—Baffller
Watson news—Is this news about Watson or news presented in Watson via a feed-type mechanism. Phrase does not even sound cool to me.

Now that’s a heck of a list. Notice that the word “search” does not appear in the list. I did not spot the word “semantics” either. Perhaps I was asleep at the switch.

When I was in freshman biology class in 1962, Dr. Daphne Swartz, a very traditional cut ‘em up and study ‘em scientist, lectured for 90 minutes about classification. I remember learning about Aristotle and this dividing organizations into two groups: Plants and animal. I know this is rocket science, but bear with me. There was the charmingly named Carolus Linnaeus, a fan of herring I believe, who cooked up the kingdom, genus, species thing. Then there was, much later, the wild and crazy library crowd which spawned Dewey or, as I named him, Mr. Decimal.

Why is this germane?

It seems to me that IBM’s list of Watson functions needs a bit of organization. In fact, some of the items appear to below to other items; for example: language detection and language translation. More egregious is the broad concept of natural language processing. One could, if one were motivated, argue that entity extraction, text extraction, and keyword extraction might look similar to a non-Watsonian intellect. Dr. Swartz would probably have some constructive criticism to offer.

What’s the purpose of this earthworm list?

Beats me. Makes IBM Watson seem more than Lucene with add ons?

Stephen E Arnold, October 7, 2015

Full Text Search Gets Explained

October 6, 2015

Full text search is a one of the primary functions of most search platform.  If a search platform cannot get full text search right, then it is useless and should be tossed in the recycle bin.    Full text search is such a basic function these days that most people do not know how to explain what it is.  So what is full text?

According to the Xojo article, “Full Text Search With SQLite” provides a thorough definition:

“What is full text searching? It is a fast way to look for specific words in text columns of a database table. Without full text searching, you would typically search a text column using the LIKE command. For example, you might use this command to find all books that have “cat” in the description…But this select actually finds row that has the letters “cat” in it, even if it is in another word, such as “cater”. Also, using LIKE does not make use of any indexing on the table. The table has to be scanned row by row to see if it contains the value, which can be slow for large tables.”

After the definition, the article turns into advertising piece for SQLite and how it improves the quality of full text search.  It offers some more basic explanation, which are not understood by someone unless they have a coding background.   It is a very brief with some detailed information, but could explain more about what SQLite is and how it improves full text search.

Whitney Grace, October 6, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

The Cricket Cognitive Analysis

September 4, 2015

While Americans scratch their heads at the sport cricket, it has a huge fanbase and not only that, there are mounds of data that can now be fully analyzed says First Post in the article, “The Intersection Of Analytics, Social Media, And Cricket In The Cognitive Era Of Computing.”

According to the article, cricket fans absorb every little bit of information about their favorite players and teams.  Technology advances have allowed the cricket players to improve their game with better equipment and ways to analyze their playing, in turn the fans have a deeper personal connection with the game as this information is released.  For the upcoming Cricket World Cup, Wisden India will provide all the data points for the game and feed them into IBM’s Analytics Engine to improve the game for spectators and the players.

Social media is a huge part of the cricket experience and the article details examples about how it platforms like Twitter are processed through sentimental analysis and IBM Text Analytics.

“What is most interesting to businesses however is that observing these campaigns help in understanding the consumer sentiment to drive sales initiatives. With right business insights in the nick of time, in line with social trends, several brands have come up with lucrative offers one can’t refuse. In earlier days, this kind of marketing required pumping in of a lot of money and waiting for several weeks before one could analyze and approve the commercial success of a business idea. With tools like IBM Analytics at hand, one can not only grab the data needed, assess it so it makes a business sense, but also anticipate the market response.”

While Cricket might be what the article concentrates on, imagine how data analytics are being applied to other popular sports such as American football, soccer, baseball, golf, and the variety of racing popular around the world.

Whitney Grace, September 4, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Suggestions for Developers to Improve Functionality for Search

September 2, 2015

The article on SiteCrafting titled Maxxcat Pro Tips lays out some guidelines for improved functionality when it comes deep search. Limiting your Crawls is the first suggestion. Since all links are not created equally, it is wise to avoid runaway crawls on links where there will always be a “Next” button. The article suggests hand-selecting the links you want to use. The second tip is Specify Your Snippets. The article explains,

“When MaxxCAT returns search results, each result comes with four pieces of information: url, title, meta, and snippet (a preview of some of the text found at the link). By default, MaxxCAT formulates a snippet by parsing the document, extracting content, and assembling a snippet out of that content. This works well for binary documents… but for webpages you wanted to trim out the content that is repeated on every page (e.g. navigation…) so search results are as accurate as possible.”

The third suggestion is to Implement Meta-Tag Filtering. Each suggestion is followed up with step-by-step instructions. These handy tips come from a partnering between Sitecrafting is a web design company founded in 1995 by Brian Forth. Maxxcat is a company acknowledged for its achievements in high performance search since 2007.

Chelsea Kerwin, September 2, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta