Google and Song Lyrics

July 13, 2016

I love the results I get for pop stars, TV shows, and binge watching. To feed the curious minds of online researchers, Google has upped the ante. “Google Licenses LyricFind for Search Results” reports that Google has addressed its miserable search systems for the words in tunes. Consider this lyric:

“My wrist deserve a shout out, I’m like “what up, wrist’?
My stove deserve a shout out, I’m like “what up, stove’?”

According to the write up:

A query for the lyrics to a specific song will pull up the words to much of that song, freeing users from having to click through to another website. Google rolled out the lyrics feature in the U.S. today (June 27), though it has licenses to display the lyrics internationally as well.

I am definitely thrilled. Why worry about the indexing of PowerPoints, PDFs, and other content when I have access to the source of:

I’m that red bull, now let’s fly away.

What’s really flown away? Rag mop.

Stephen E Arnold, July 13, 2016

Written by Stephen E. Arnold · Filed Under Google, Search | Comments Off on Google and Song Lyrics

Why Enterprise Search Fails

July 12, 2016

I participated in a telephone call before the US holiday break. The subject was the likelihood of a potential investment in an enterprise search technology would be a winner. I listened for most of the 60 minute call. I offered a brief example of the over promise and under deliver problems which plagued Convera and Fast Search & Transfer and several of the people on the call asked, “What’s a Convera?” I knew that today’s whiz kids are essentially reinventing the wheel.

I wanted to capture three ideas which I jotted down during that call. My thought is that at some future time, a person wanting to understand the incredible failures that enterprise search vendors have tallied will have three observations to consider.

No background is necessary. You don’t need to read about throwing rocks at the Google bus, search engine optimization, or any of the craziness about search making Big Data a little pussycat.

Enterprise Search: Does a Couple of Things Well When Users Expect Much More

Enterprise search systems ship with filters or widgets which convert source text into a format that the content processing module can index. The problem is that images, videos, audio files, content from wonky legacy systems, or proprietary file formats like IBM i2’s ANB files do not lend themselves to indexing by a standard enterprise search system. The buyers or licensees of the enterprise search system do not understand this one trick pony nature of text retrieval. Therefore, when the system is deployed, consternation follows confusion when content is not “in” the enterprise search system and, therefore, cannot be found. There are systems which can deal with a wide range of content, but these systems are marketed in a different way, often cost millions of dollars a year to set up, maintain, and operate.

Net net: Vendors do not explain the limitations of text search. Licensees do not take the time or have the desire to understand what an enterprise search system can actually do. Marketers obfuscate in order to close the deal. Failure is a natural consequence.

Data Management Needed

The disconnect boils down to what digital information the licensee wants to search. Once the universe is defined, the system into which the data will be placed must be resolved. No data management, no enterprise search. The reason is that licensees and the users of an enterprise search system assume that “all” or “everything” – maps to web content, email to outputs from an AS/400 Ironside are available any time. Baloney. Few organizations have the expertise or the appetite to deal with figuring out what is where, how much, how frequently each type of data changes, and the formats used. I can hear you saying, “Hey, we know what we have and what we need. We don’t need a stupid, time consuming, expensive inventory.” There you go. Failure is a distinct possibility.

Net net: Hope springs eternal. When problems arise, few know what’s where, who’s on first, and why I don’t know is on third.

Written by Stephen E. Arnold · Filed Under Enterprise search, Feature | Comments Off on Why Enterprise Search Fails

Enterprise Search Is Stuck in the Past

July 4, 2016

Enterprise search is one of the driving forces behind an enterprise system because the entire purpose of the system is to encourage collaboration and quickly find information. While enterprise search is an essential tool, according to Computer Weekly’s article. “Beyond Keywords: Bringing Initiative To Enterprise Search” the feature is stuck in the past.

Enterprise search is due for an upgrade. The amount of enterprise data has increased, but the underlying information management system remains the same. Structured data is easy to make comply with the standard information management system, however, it is the unstructured data that holds the most valuable information. Unstructured information is hard to categorize, but natural language processing is being used to add context. Ontotext combined natural language processing with a graph database, allowing the content indexing to make more nuanced decisions.

We need to level up the basic keyword searching to something more in-depth:

“Search for most organisations is limited: enterprises are forced to play ‘keyword bingo’, rephrasing their question multiple times until they land on what gets them to their answer. The technologies we’ve been exploring can alleviate this problem by not stopping at capturing the keywords, but by capturing the meaning behind the keywords, labeling the keywords into different categories, entities or types, and linking them together and inferring new relationships.”

In other words, enterprise search needs the addition of semantic search in order to add context to the keywords. A basic keyword search returns every result that matches the keyword phrase, but a context-driven search actually adds intuition behind the keyword phrases. This is really not anything new when it comes to enterprise or any kind of search. Semantic search is context-driven search.

Whitney Grace, July 4, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Written by Stephen E. Arnold · Filed Under Data, Enterprise, Enterprise search, News, Security | Comments Off on Enterprise Search Is Stuck in the Past

Amazon AWS Jungle Snares Some Elasticsearch Functions

July 1, 2016

Elastic’s Elasticsearch has become one of the go to open source search and retrieval solutions. Based on Lucene, the system has put the heat on some of the other open source centric search vendors. However, search is a tricky beastie.

Navigate to “AWS Elasticsearch Service Woes” to get a glimpse of some of the snags which can poke holes in one’s rip stop hiking garb. The problems are not surprising. One does not know what issues will arise until a search system is deployed and the lucky users are banging away with their queries or a happy administrator discovers that Button A no longer works.

The write up states:

We kept coming across OOM issues due the JVMMemoryPresure spiking and inturn the ES service kept crapping out. Aside from some optimization work, we’d more than likely have to add more boxes/resources to the cluster which then means more things to manage. This is when we thought, “Hey, AWS have a service for this right? Let’s give that a crack?!”. As great as having it as a service is, it certainly comes with some fairly irritating pitfalls which then causes you to approach the situation from a different angle.

One approach is to use templates to deal with the implementation of shard management in AWS Elasticsearch. Sample templates are provided in the write up. The fix does not address some issues. The article provides a link to a reindexing tool called es-tool.

The most interesting comment in the article in my opinion is:

In hindsight I think it may have been worth potentially sticking with and fleshing out the old implementation of Elasticsearch, instead of having to fudge various things with the AWS ES service. On the other hand it has relieved some of the operational overhead, and in terms of scaling I am literally a couple of clicks away. If you have large amounts of data you pump into Elasticsearch and you require granular control, AWS ES is not the solution for you. However if you need a quick and simple Elasticsearch and Kibana solution, then look no further.

My takeaway is to do some thinking about the strengths and weaknesses of the Amazon AWS before chopping through the Bezos cloud jungle.

Stephen E Arnold, July 1, 2016

Written by Stephen E. Arnold · Filed Under Cloud computing, News, Open source, Search | Comments Off on Amazon AWS Jungle Snares Some Elasticsearch Functions

Google Search: Retrievers Lose. Smart Software Wins

June 28, 2016

I scanned a number of write ups about Google’s embrace of machine learning and smart software. I supplement my Google queries with the results of other systems. Some of these have their own index; for example, Yandex.ru and Exalead. Others are metasearch engines will suck in results and do some post processing to help answer the users’ questions. Others are disappointing and I check them out when I have a client who is willing to pay for stone flipping; for example, DuckDuckGo, iSeek, or the estimable Qwant. (I love quirky spelling too.)

I read “RankBrain Third Most Important Factor Determining Google Search Results.” Here’s the quote I noted:

Google is characteristically fuzzy on exactly how it improves search (something to do with the long tail? Better interpretation of ambiguous requests?) but Jeff Dean [former AltaVista wizard] says that RankBrain is “involved in every query,” and affects the actual rankings “probably not in every query but in a lot of queries.” What’s more, it’s hugely effective. Of the hundreds of “signals” Google search uses when it calculates its rankings (a signal might be the user’s geographical location, or whether the headline on a page matches the text in the query), RankBrain is now rated as the third most useful. “It was significant to the company that we were successful in making search better with machine learning,” says John Giannandrea. “That caused a lot of people to pay attention.”Pedro Domingos, the University of Washington professor who wrote The Master Algorithm, puts it a different way: “There was always this battle between the retrievers and the machine learning people,” he says. “The machine learners have finally won the battle.”

I have noticed in the last year, that I am unable to locate certain documents when I use the words and phrases which had served me well before smart software became the cat’s pajamas.

One recent example was my need to locate a case example about a German policeman’s trials and tribulations with the Dark Web. When I first located this document, I was trying to verify an anecdote shared with me after one of my intelligence community lectures.

I had the document in my file and I pulled it up on my monitor. The document in question is the work of an outfit and person labeled “Lars Hilse.” The title of the write up is “Dark Web & Bitcoin: Global Terrorism “Threat Assessment. The document was published in April 2013 with an update issued in November 2013. (That document was the source or maybe confirmed the anecdote about the German policeman and his Dark Web research.)

For my amusement, I wondered if I could use the new and improved Google Web search to locate the document. I display section 4.8 on my screen. The heading of the section is “Extortion (of Law Enforcement Personnel).

I entered the phrase into Google without quotes. Here’s the first page of results:

None of the hits points to the document with the five word phrase.

Written by Stephen E. Arnold · Filed Under Feature, Google, Search | Comments Off on Google Search: Retrievers Lose. Smart Software Wins

Enterprise Search Vendor Sinequa Partners with MapR

June 8, 2016

In the world of enterprise search and analytics, everyone wants in on the clients who have flocked to Hadoop for data storage. Virtual Strategy shared an article announcing Sinequa Collaborates With MapR to Power Real-Time Big Data Search and Analytics on Hadoop. A firm specializing in big data, Sinequa, has become certified with the MapR Converged Data Platform. The interoperation of Sinequa’s solutions with MapR will enable actionable information to be gleaned from data stored in Hadoop. We learned,

“By leveraging advanced natural language processing along with universal structured and unstructured data indexing, Sinequa’s platform enables customers to embark on ambitious Big Data projects, achieve critical in-depth content analytics and establish an extremely agile development environment for Search Based Applications (SBA). Global enterprises, including Airbus, AstraZeneca, Atos, Biogen, ENGIE, Total and Siemens have all trusted Sinequa for the guidance and collaboration to harness Big Data to find relevant insight to move business forward.”

Beyond all the enterprise search jargon in this article, the collaboration between Sinequa and MapR appears to offer an upgraded service to customers. As we all know at this point, unstructured data indexing is key to data intake. However, when it comes to output, technological solutions that can support informed business decisions will be unparalleled.

Megan Feil, June 8, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Written by Stephen E. Arnold · Filed Under Analytics, Big data, Dark Web, Enterprise search, News, Technology | Comments Off on Enterprise Search Vendor Sinequa Partners with MapR

Speculation About Beyond Search

June 2, 2016

If you are curious to learn more about the purveyor of the Beyond Search blog, you should check out Singularity’s interview with “Stephen E Arnold On Search Engine And Intelligence Gathering.” A little bit of background about Arnold is that he is an expert specialist in content processing, indexing, online search as well as the author of seven books and monographs. His past employment record includes Booz, Allen, & Hamilton (Edward Snowden was a contractor for this company), Courier Journal & Louisville Times, and Halliburton Nuclear. He worked on the US government’s Threat Open Source Intelligence Service and developed a cost analysis, technical infrastructure, and security for the FirstGov.gov.

Singualrity’s interview covers a variety of topics and, of course, includes Arnold’s direct sense of humor:

“During our 90 min discussion with Stephen E. Arnold we cover a variety of interesting topics such as: why he calls himself lucky; how he got interested in computers in general and search engines in particular; his path from college to Halliburton Nuclear and Booze, Allen & Hamilton; content and web indexing; his who’s who list of clients; Beyond Search and the core of intelligence; his Google Trilogy – The Google Legacy (2005), Google Version 2.0 (2007), and Google: The Digital Gutenberg (2009); CyberOSINT and the Dark Web Notebook; the less-known but major players in search such as Recorded Future and Palantir; Big Brother and surveillance; personal ethics and Edward Snowden.”

When you listen to the experts in certain fields, you always get a different perspective than what the popular news outlets gives. Arnold offers a unique take on search as well as the future of Internet security, especially the future of the Dark Web.

Whitney Grace, June 2, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Written by Stephen E. Arnold · Filed Under algorithms, Consulting, Dark Web, Google, News, Security | Comments Off on Speculation About Beyond Search

Search Sink Hole Identified and Allegedly Paved and Converted to a Data Convenience Store

May 20, 2016

I try to avoid reading more than one write up a day about alleged revolutions in content processing and information analytics. My addled goose brain cannot cope with the endlessly recycled algorithms dressed up in Project Runway finery.

I read “Ryft: Bringing High Performance Analytics to Every Enterprise,” and I was pleased to see a couple of statements which resonated with my dim view of information access systems. There is an accompanying video in the write up. I, as you may know, gentle reader, am not into video. I prefer reading, which is the old fashioned way to suck up useful factoids.

Here’s the first passage I highlighted:

Any search tool can match an exact query to structured data—but only after all of the data is indexed. What happens when there are variations? What if the data is unstructured and there’s no time for indexing? [Emphasis added]

The answer to the question is increasing costs for sales and marketing. The early warning for amped up baloney are the presentations given at conferences and pumped out via public relations firms. (No, Buffy, no, Trent, I am not interested in speaking with the visionary CEO who hired you.)

I also highlighted:

With the power to complete fuzzy search 600X faster at scale, Ryft has opened up tremendous new possibilities for data-driven advances in every industry.”

I circled the 600X. Gentle reader, I struggle to comprehend a 600X increase in content processing. Dear Mother Google has invested to create a new chip to get around the limitations of our friend Von Neumann’s approach to executing instructions. I am not sure Mother Google has this nailed because Mother Google, like IBM, announces innovations without too much real world demonstration of the nifty “new” things.

I noted this statement too:

For the first time, you can conduct the most accurate fuzzy search and matching at the same speed as exact search without spending days or weeks indexing data.

Okay, this strikes me as a capability I would embrace if I could get over or around my skepticism. I was able to take a look at the “solution” which delivers the astounding performance and information access capability. Here’s an image from Ryft’s engineering professionals:

Notice that we have Spark and pre built components. I assume there are myriad other innovations at work.

The hitch in the git along is that in order to deal with certain real world information processing challenges, the inputs come from disparate systems, each generating substantial data flows in real time.

Here’s an example of a real world information access and understanding challenge, which, as far as I know, has not been solved in a cost effective, reliable, or usable manner.

Image source: Plugfest 2016 Unclassified.

This unclassified illustration makes clear that the little things in the sky pump out lots of data into operational theaters. Each stream of data must be normalized and then converted to actionable intelligence.

The assertion about 600X sounds tempting, but my hunch is that the latency in normalizing, transferring, and processing will not meet the need for real time, actionable, accurate outputs when someone is shooting at a person with a hardened laptop in a threat environment.

In short, perhaps the spark will ignite a fire of performance. But I have my doubts. Hey, that’s why I spend my time in rural Kentucky where reasonable people shoot squirrels with high power surplus military equipment.

Stephen E Arnold, May 20, 2016

Written by Stephen E. Arnold · Filed Under Analytics, News, Search, Text processing | Comments Off on Search Sink Hole Identified and Allegedly Paved and Converted to a Data Convenience Store

Big Data and Value

May 19, 2016

I read “The Real Lesson for Data Science That is Demonstrated by Palantir’s Struggles · Simply Statistics.” I love write ups that plunk the word statistics near simple.

Here’s the passage I highlighted in money green:

… What is the value of data analysis?, and secondarily, how do you communicate that value?

I want to step away from the Palantir Technologies’ example and consider a broader spectrum of outfits tossing around the jargon “big data,” “analytics,” and synonyms for smart software. One doesn’t communicate value. One finds a person who needs a solution and crafts the message to close the deal.

When a company and its perceived technology catches the attention of allegedly informed buyers, a bandwagon effort kicks in. Talks inside an organization leads to mentions in internal meetings. The vendor whose products and services are the subject of these comments begins to hint at bigger and better things at conferences. Then a real journalist may catch a scent of “something happening” and writes an article. Technical talks at niche conferences generate wonky articles usually without dates or footnotes which make sense to someone without access to commercial databases. If a social media breeze whips up the smoldering interest, then a fire breaks out.

A start up should be so clever, lucky, or tactically gifted to pull off this type of wildfire. But when it happens, big money chases the outfit. Once money flows, the company and its products and services become real.

The problem with companies processing a range of data is that there are some friction inducing processes that are tough to coat with Teflon. These include:

Taking different types of data, normalizing it, indexing it in a meaningful manner, and creating metadata which is accurate and timely
Converting numerical recipes, many with built in threshold settings and chains of calculations, into marching band order able to produce recognizable outputs.
Figuring out how to provide an infrastructure that can sort of keep pace with the flows of new data and the updates/corrections to the already processed data.
Generating outputs that people in a hurry or in a hot zone can use to positive effect; for example, in a war zone, not get killed when the visualization is not spot on.

The write up focuses on a single company and its alleged problems. That’s okay, but it understates the problem. Most content processing companies run out of revenue steam. The reason is that the licensees or customers want the systems to work better, faster, and more cheaply than predecessor or incumbent systems.

The vast majority of search and content processing systems are flawed, expensive to set up and maintain, and really difficult to use in a way that produces high reliability outputs over time. I would suggest that the problem bedevils a number of companies.

Some of those struggling with these issues are big names. Others are much smaller firms. What’s interesting to me is that the trajectory content processing companies follow is a well worn path. One can read about Autonomy, Convera, Endeca, Fast Search & Transfer, Verity, and dozens of other outfits and discern what’s going to happen. Here’s a summary for those who don’t want to work through the case studies on my Xenky intel site:

Stage 1: Early struggles and wild and crazy efforts to get big name clients

Stage 2: Making promises that are difficult to implement but which are essential to capture customers looking actively for a silver bullet

Stage 3: Frantic building and deployment accompanied with heroic exertions to keep the customers happy

Stage 4: Closing as many deals as possible either for additional financing or for licensing/consulting deals

Stage 5: The early customers start grousing and the momentum slows

Stage 6: Sell off the company or shut down like Delphes, Entopia, Siderean Software and dozens of others.

The problem is not technology, math, or Big Data. The force which undermines these types of outfits is the difficulty of making sense out of words and numbers. In my experience, the task is a very difficult one for humans and for software. Humans want to golf, cruise Facebook, emulate Amazon Echo, or like water find the path of least resistance.

Making sense out of information when someone is lobbing mortars at one is a problem which technology can only solve in a haphazard manner. Hope springs eternal and managers are known to buy or license a solution in the hopes that my view of the content processing world is dead wrong.

So far I am on the beam. Content processing requires time, humans, and a range of flawed tools which must be used by a person with old fashioned human thought processes and procedures.

Value is in the eye of the beholder, not in zeros and ones.

Stephen E Arnold, May 19, 2016

Written by Stephen E. Arnold · Filed Under Analytics, News, Technology, Text analytics, Text processing | 1 Comment

Enterprise Search: The Valiant Fight On

May 17, 2016

I read “VirtualWorks and Language Tools Announce Merger.” I ran across Language Tools several years ago. The company was working to create components for ElasticSearch’s burgeoning user base. The firm espoused natural language processing as a core technology. NLP is useful, but it imposes some computational burdens on some content processing functions. ElasticSearch works pretty well, and there are a number of companies optimizing, integrating, and creating widgets to make life with ElasticSearch better, faster, and presumably more impressive than the open source system is.

This news release highlights the fact that VirtualWorks and Language Tools have merged. The financial details are not explicit, and it appears that a company founded by a wizard from Citrix will make Language Tools’ R&D hub for the Florida-based VirtualWorks’ operation.

According to the story:

The combined organization brings together best of breed core technologies in the areas of enterprise search, data management, text analytics, discovery techniques and analytics to enable the development of new and exciting next generation applications in the business intelligence space.

VirtualWorks is or was a SharePoint centric solution. Like other search vendors, the company uses connectors to suck data into a central indexing point. Users then search the content and have access to the content without having to query separate systems.

This idea has fueled enterprise search since the days of Verity, Autonomy, Fast Search, Convera, et al. The real money today seems to be in the consulting and engineering services required to make enterprise search useful.

SharePoint is certainly widely used, and it is fraught with interesting challenges. Will the lash up of these two firms generate the type of revenue once associated with Autonomy and Fast Search & Transfer?

My hunch is that enterprise search continues to be a tough market. There are functional solutions to locating information available as open source or at comparatively modest license fees. I am thinking of dtSearch and Maxxcat. Both of these work well within Microsoft centric environments.

Stephen E Arnold, May 17, 2016

Written by Stephen E. Arnold · Filed Under Acquisition, News, Search | Comments Off on Enterprise Search: The Valiant Fight On

« Previous Page — Next Page »

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.

Categories
- 3D-Printing
- Acquisition
- Advertising
- Aggregation
- AI
- Alexa
- algorithms
- Amazon
- Amazonia
- Analytics
- Appliance
- Applications
- Audio
- Augmented Reality
- Big data
- Bing
- Bitcoin
- Bitext
- Book review
- Business intelligence
- Business process
- Business strategy
- Censorship
- Cloud computing
- Company Profile
- Conferences
- Connectors
- Consulting
- Consumer
- Content processing
- Copyright
- Corporate Concerns
- Cost
- Crawl
- Crowdfunding
- cryptocurrency
- Customer support
- Cyber OSINT
- cybercrime
- cybersecurity
- Dark Web
- DarkCyber
- Data
- Data mining
- Database
- Deepfakes
- Digital Assistant
- Digital Library
- E2EE
- ECommerce
- EDiscovery
- Editorial opinion
- Education
- Emoticons
- Enterprise
- Enterprise search
- Entity extraction
- Ethics
- Facebook
- Faceted search
- Factualities
- Feature
- Federated search
- Financial
- Fogint
- Google
- Governance
- Government
- Hackers
- healthcare
- IBM Watson
- Image search
- Indexing
- Infrastructure
- Innovation
- Integration
- intelware
- Interface
- Internet
- Interview
- Investment
- law enforcement
- Legal matters
- Library automation
- Management
- Marketing
- Mathematics
- Metadata
- Microsoft
- Mobile
- Natural language processing
- News
- NGIA
- Online (general)
- Open Access
- Open source
- OSINT
- Osint Radar
- Overflight
- Palantir
- Patents
- Personnel
- Podcast
- Policeware
- Portals
- Predictive coding
- Privacy
- Profile
- Publishing
- Quotation
- Real time search
- Reference tool
- Rich media
- Robot Writer
- Search
- Search enabled applications
- search engine
- Search quality
- Security
- Semantic
- Sentiment analysis
- SEO
- SharePoint
- Short Honks
- Smart Technology
- Social
- Social Media
- software
- Statistics
- Taxonomy
- Technology
- Text analytics
- Text processing
- Tools
- Tor
- Training
- Translation
- Twitter
- Uncategorized
- Unstructured Data
- User experience
- User Interface
- Vertical search
- Video
- visualization
- Voice search
- Voice technology
- Web 3
- Web Services
- Webinar
- Windows
- Work flow
- XML
- Yahoo

Beyond Search

Google and Song Lyrics

Why Enterprise Search Fails

Enterprise Search: Does a Couple of Things Well When Users Expect Much More

Data Management Needed

Enterprise Search Is Stuck in the Past

Amazon AWS Jungle Snares Some Elasticsearch Functions

Google Search: Retrievers Lose. Smart Software Wins

Enterprise Search Vendor Sinequa Partners with MapR

Speculation About Beyond Search

Search Sink Hole Identified and Allegedly Paved and Converted to a Data Convenience Store

Big Data and Value

Enterprise Search: The Valiant Fight On

Search the site

Categories

Archives

Recent Posts

Meta

Share this:

Enterprise Search: Does a Couple of Things Well When Users Expect Much More

Data Management Needed

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Search the site

Categories

Archives

Recent Posts

Meta