Index and Search: The Threat Intel Positioning

December 24, 2015

The Dark Web is out there. Not surprisingly, there are a number of companies indexing Dark Web content. One of these firms is Digital Shadows. I learned in “Cyber Threat Intelligence and the Market of One” that search and retrieval has a new suit of clothes. The write up states:

Cyber situational awareness shifts from only delivering generic threat intelligence that informs, to also delivering specific information to defend against adversaries launching targeted attacks against an organization or individual(s) within an organization. Cyber situational awareness brings together all the information that an organization possesses about itself such as its people, risk posture, attack surface, entire digital footprint and digital shadow (a subset of a digital footprint that consists of exposed personal, technical or organizational information that is often highly confidential, sensitive or proprietary). Information is gathered by examining millions of social sites, cloud-based file sharing sites and other points of compromise across a multi-lingual, global environment spanning the visible, dark and deep web.

The approach seems to echo the Palantir “platform” approach. Palantir, one must not forget, is a 2015 version of the Autonomy platform. The notion is that content is acquired, federated, and made useful via outputs and user friendly controls.

What’s interesting is that Digital Shadows indexes content and provides a search system to authorized users. Commercial access is available via tie up in the UK.

My point is that search is alive and well. The positioning of search and retrieval is undergoing some fitting and tucking. There are new terms, new rationale for business cases (fear is workable today), and new players. Under the surface are crawlers, indexes, and search functions.

The death of search may be news to the new players like Digital Shadows, Palantir, and Recorded Future, among numerous other shape shifters.

Stephen E Arnold, December 24, 2015

Podcast Search Service

December 18, 2015

I read “Podcasting’s Search Problem Could be Solved by This Spanish Startup.” According to the write up:

Smab’s web app will automatically transcribe podcasts, giving listeners a way to scan and search their content.

What’s the method? I learned from the article:

The company takes audio files and generates text files. If those text files are hosted on Smab’s site, a person can click on a word in the transcript and it will take them directly to that part of the recording, because the transcript and the text are synced. In fact, a second program assesses the audio to determine where sentences begin, making it easier to find chunks of audio. Both functions are uneven, but it’s worth noting here that the company is in a very early stage.

There are three challenges for automatic voice to text to indexing from audio and video sources:

First, there is a great deal of content. The computational cost to covert a large chunk of audio data to a searchable form and then offer a reasonably robust search engine is significant.

Second, selectivity requires an editorial policy. Business and government are likely paying customers, but the topics these folks chase change frequently. The risk is that a paying customer will be disappointed and drop the service. Thus, sustainable revenue may be an issue.

Third, indexing podcasts and YouTube is work that Apple handles rather off handedly and YouTube performs as part of its massive investment in the Google search system. The fact that neither of these firms has pushed forward with more sophisticated search systems suggests that market demand may not be significant.

I hope the Smab service becomes available. Worth watching.

Stephen E Arnold, December 21, 2015

Google Indexes Some Dynamic Content

December 10, 2015

If you generate Web pages dynamically (who doesn’t?), you may want to know if the Alphabet Google thing can index the content on dynamic pages.

For some apparently objective information about the GOOG’s ability to index dynamic content, navigate to “Does Google Crawl Dynamic Content?” The article considers 11 types of dynamic content methods.

Here’s the passage I highlighted:

  • Google crawls and indexes all content that was injected by javascript.
  • Google even shows results in the SERP that are based on asynchronously injected content.
  • Google can handle content from httpRequest().
  • However, JSON-LD as such does not necessarily lead to SERP results (as opposed to the officially supported SERP entities that are not only indexed, but also used to decorate the SERP).
  • Injected JSON-LD gets recognized by the structured data testing tool – including Tag Manager injection. This means that once Google decides to support the entities, indexing will not be a problem.
  • Dynamically updated meta elements get crawled and indexed, too.

The question one may wish to consider is, “What does Alphabet Google do with that information and data?” There are some clues in the Ramanathan Guha patent documents filed in 2007.

Stephen E Arnold, December 10, 2015

European ECommerce Search Vendors

December 7, 2015

I read “Suchfunktion: Mehr Treffer – mehr Umsatz. “ If you read German, you will learn about several eCommerce search solutions. These are:

  • Epoq Search
  • Exorbyte
  • Fact Finder
  • Findologic
  • SDL Fredhopper
  • Searchperience

Epoq Search, according to the firm’s Web site delivers error tolerant eCommerce search.

Exorbyte is an eCommerce search system which can also handle some enterprise search tasks.

Fact Finder, the best German search engine, according to the company’s Web site, delivers a new backend experience. You can learn more about this firm’s approach to eCommerce search at this link.

Findologic wants to have customers stop searching and find. The system’s features are described briefly at this link.

SDL Fredhopper. I have always liked the name Fredhopper. The system is now SDL eCommerce Optimization. Farewell, Fredhopper. You can learn about the system which is about 20 years old at this link. SDL is the translation outfit.

Searchperience is a cloud and eComerce search system. The system does “professional indexing.” More information is available at this link.

Why did I provide links? The reason is that the source article did not include links. The descriptions of the system are helpful, but the value of the write up pivots on companies not mentioned in the write ups about search originating in the US.

Stephen E Arnold, December 7, 2015

Advances to Google Search for Mobile

December 7, 2015

Google Search plans a couple of changes to the way it works on our mobile devices. TechCrunch tells us, “Google Search Now Surfaces App-Only Content, Streams Apps from the Cloud When Not Installed on Your Phone.” We are reminded that Google has been indexing apps for a couple of years now, as a hedge against losing ground as computing shifts away from the desktop. Now, apps that do not mirror their content on the web can be indexed. Writer Sarah Perez explains:

“To make this possible, developers only have to implement Google’s app indexing API, as before, which helps Google to understand what a page is about and how often it’s used. It has also scaled its ranking algorithm to incorporate app content. (Google had actually been ranking in-app content prior to this, but now it no longer requires apps to have related websites.)”

Also, mobile users will reportedly be able to stream apps from the cloud if they do not have them installed. Though convenient for the rest of us, this advance could pose a problem for app developers; Perez observes:

“After all, if their app’s most valuable information is just a Google search away, what motive would a user have to actually install their app on their phone? Users would have to decide if they plan on using the app frequently enough that having the native counterpart would be an advantage. Or the app would have to offer something Google couldn’t provide, like offline access to content perhaps.“

It will be interesting to see what gimmicks developers come up with to keep the downloads going. The tech behind this service came from startup Agawi, which Google acquired in 2014. The streaming option is not expected to be released far and wide right away, however; apparently Google views it as an “experiment,” and wants to see how it is received before offering it worldwide. They couldn’t be concerned about developer backlash, could they?

Cynthia Murrell, December 7, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

Has Enterprise Search Drowned in a Data Lake?

December 6, 2015

I had a phone conversation with a person who unluckily resides in a suburb of New York City. Apparently the long commute allows my contact to think big thoughts and formulate even bigger questions. He asked me, “What’s going to happen to enterprise search?”

I thought this was a C minus questions, but New Yorkers only formulate A plus questions. I changed the subject to the new Chick Fil-A on Sixth. After the call, I jotted down some thoughts about enterprise search.

Here for your contemplation are five of my three comments which consumed three legal pad sheets. I also write small.

Enterprise Search Is Week Old Celery

In the late 1990s when the Verity hype machine was rolling and the Alphabet boys were formulating big thoughts about search, enterprise search was the hot ticket. For some techno cravers, enterprise search was the Alpha and Omega. If information is digital, finding an item of information was the thrill ride ending in a fluffy pile of money. A few folks made some money, but the majority of the outfits jumping into search either sold out or ended up turning off the lights. Today, enterprise search is a utility and the best approach is to use an open source solution. There are some proprietary systems out there, but the appeal of open source is tough to resist. Remember. Search is a utility, not a game changer for many organizations. Good enough tramples over precision, recall, and relevance.

New Buzzwords and the Same Old Tune

Hot companies today do not pound their electric guitar with the chords in findability. Take an outfit like Palantir. It is a search and information access outfit, but the company avoids the spotlight, positions its technology packages as super stealthy magic insight machines. Palantir likes analytics, visualizations, and similar next generation streamlined tangerine colored outputs. Many of the companies profiled in my monograph Cyberosint are, at their core, search systems. But “search” is tucked into a corner, and the amplified functions like fancy math, real time processing, and smart software dominate. From my point of view, these systems are search repackaged and enhanced for today’s procurement professionals. That’s okay. But search is still search no matter what the “visionaries” suggest. Many systems are enterprise search wrapped in new sheet music. The notes are the same.

Big Data

I find the Big Data theme interesting. The idea of processing petabytes of data in a blink is future forward. The problem is that the way statistical procedures operate is to sidestep analyzing every single item. I can examine a grocery list of 10 items, but I struggle when presented with a real time updating of that list with trillions of data points a second. The reality of Big Data is that it has been around. A monk faced with copying two books in a couple of days has an intractable Big Data problem. The love of Hadoop and its extended family of data management tools does not bring the black sheep of the information family into the party room. Big Data requires pesky folks who have degrees in statistics or who have spent their youth absorbed in Mathematica, MatLab, SPSS, or SAS. Bummer. Enterprise search systems can choke on modest data. Big Data kills some systems dead like a wasp sprayed with Raid.

Real Time

For a client in the UK, I had to dig into the notion of real time. Guess what the goslings found. There was not one type of real time information system. I believe there were seven distinct types of real time information. Each type has separate cost and complexity challenges. The most expensive systems were the ones charged with processing financial transactions in milliseconds. Real time for a Web site might mean anything from every 10 second or every week or so. Real time is tough because no matter what technologies are used to speed up computer activities, the dark shadow of latency falls. When data arrive which are malformed, the real time system returns incomplete outputs. Yikes. Incomplete? Yep, missing info. Real time is easy to say, but tough to deliver at a price an average Fortune 1000 company can afford or one of the essential US or UK government agencies can afford. Speed means lots of money. Enterprise search systems usually struggle with the real time thing.

Automatic, Smart Indexing, Outputs, Whatever

I know the artificial intelligence, cognitive approach to information is a mini megatrend. Unfortunately when folks look closely at these systems, there remains a need for slug like humans to maintain dictionaries, inspect outputs and tune algorithms, and “add value” when a pesky third party cooks up a new term, phrase, or code. Talk about smart software does  not implement useful smart software. The idea is as appealing today as it was when Fulcrum in Ottawa pitched its indexing approach or when iPhrase talked about its smart system. I am okay with talk as long as the speakers acknowledge perpetual and include in the budget the humans who have to keep these motion Rube Goldberg confections on point. Humans are not very good indexers. Automated indexing systems are not very good indexers. The idea is, of course, that good enough is good enough. Sorry. Work remains for the programmers. The marketers just talk about the magic of smart systems. Licensees expect the systems to work, which is an annoying characteristic of some licensees and users.

Wrap Up

Poor enterprise search. Relegated to utility status. Wrapped up in marketing salami. Celebrated by marketers who want to binge watch Parks and Recreation.

Enterprise search. You are still around, just demoted. The future? Good enough. Invest in hyper marketing and seek markets which do not have a firm grasp of search and retrieval. Soldier on. There are many streaming videos to watch if you hit the right combination on the digital slot machine.

Stephen E Arnold, December 6. 2015

XML Marches On

December 2, 2015

For fans of XML and automated indexing, there’s a new duo in town. The shoot out at the JSON corral is not scheduled, but you can get the pre show down information in “Smartlogic and MarkLogic Corporation Enhance Platform Integration between Semaphore and MarkLogic Database.” Rumors of closer ties between the outfits surfaced earlier this year. I pinged one of the automated indexing company’s wizards and learned, “Nope, nothing going on.” Gee, I almost believed this until a virtual strategy story turned up. Virtual no more.

According to the write up:

Smartlogic, the Content Intelligence Company, today announced tighter software integration with MarkLogic, the Enterprise NoSQL database platform provider, creating a seamless approach to semantic information management where organizations maximize information to drive change. Smartlogic’s Content Intelligence capabilities provide a robust set of semantic tools which create intelligent metadata, enhancing the ability of the enterprise-grade MarkLogic database to power smarter applications.

For fans of user  friendliness, the tie up may mean more XQuery scripting and some Semaphore tweaks. And JSON? Not germane.

What is germane is that Smartlogic may covet some of MarkLogic’s publishing licensees. After slicing and dicing, some of these outfits are having trouble finding out what their machine assisted editors have crafted with refined quantities of editorial humans.

Stephen E Arnold, December 2, 2015

Inferences: Check Before You Assume the Outputs Are Accurate

November 23, 2015

Predictive software works really well as long as the software does not have to deal with horse races, the stock market, and the actions of single person and his closest pals.

Inferences from Backtest Results Are False Until Proven True” offers a useful reminder to those who want to depend on algorithms someone else set up. The notion is helpful when the data processed are unchecked, unfamiliar, or just assumed to be spot on.

The write up says:

the primary task of quantitative traders should be to prove specific backtest results worthless, rather than proving them useful.

What throws backtests off the track? The write up provides a useful list of reminders:

  1. Data-mining and data snooping bias
  2. Use of non tradable instruments
  3. Unrealistic accounting of frictional effects
  4. Use of the market close to enter positions instead of the more realistic open
  5. Use of dubious risk and money management methods
  6. Lack of effect on actual prices

The author is concerned about financial applications, but the advice may be helpful to those who just want to click a link, output a visualization, and assume the big spikes are really important to the decision you will influence in one hour.

One point I highlighted was:

Widely used strategies lose any edge they might have had in the past.

Degradation occurs just like the statistical drift in Bayesian based systems. Exciting if you make decisions on outputs known to be flawed. How is that automatic indexing, business intelligence, and predictive analytics systems working?

Stephen E Arnold, November 23, 2015

Life Is Perceived As Faster Says Science

November 20, 2015

I read a spider friendly, link baitable article in a UK newspaper. You love these folks because each page view downloads lots and lots of code, ads, trackers, etc.

The story was “Can’t Believe It’s Almost Christmas? Technology Is Speeding Up Our Perception of Time, Researchers Say.” Heck of a title in my opinion.

The main point is captured in this quote from Wizard McLoughlin:

long monologue from a ‘real’ book.

‘It’s almost as though we’re trying to emulate the technology and be speedier and more efficient. It seems like there’s something about technology itself that primes us to increase that pacemaker inside of us that measures the passing of time.”

The “it” I assume means the way the modern world works.

I think the idea is valid. A good example is the behavior of search and content processing companies. Although many companies evidence the behaviors I want to identity, these quirks are most evident among the search and content processing outfits which have ingested tens of millions in venture funding.

The time pressure comes from the thought process like this statement which I recall from my reading of Samuel Johnson:

Nothing focuses the mind like a hanging.

The search and processing vendors under the most pressure appear to be taking the following actions. These comments apply to Attivio, BA Insight ,Coveo, and Lucidworks-type companies. The craziness of IBM Watson and HP Autonomy in the cloud may have other protein triggers.

Here we go:

  • Big data. How can outfits which struggle to update indexes and process new and changed content handle Big Data? It is just trendy to call a company like Vivisimo a Big Data firm than try to explain that key word search has “real” value.
  • Customer support. I don’t know about you, but I avoid customer support. Customer support means stupid telephone selections, dorky music, reminders that the call is being monitored for “quality purposes”, and other cost cutting, don’t-bother-us approaches. Where search and content processing fits in has little to do with customer service and everything about cost reduction.
  • Analytics. Yep, indexing systems can output a list of the number of times a word appears in a document, a batch, or a time period. These items can be counted and converted to a graph. But I do not think that enterprise search systems are analytics systems. Again. If it helps close a deal, go with it.
  • Business intelligence. I like this one. The idea that a person can look for the name of a person, place, or thing provides intelligence is laughable. I also get a kick out selective dissemination functions or standing queries presented as a magical window on real time data. Baloney. Intelligence is not a variation of search and content processing. Search and content processing are utility functions within a larger more comprehensive systems. Check out NetReveal and let me know how close an enterprise search vendor comes to this BAE Systems’ service.

When will enterprise search and content processing vendors alter their marketing?

Not until their stakeholders are able to sell these outfits and move on to less crazy investments.

The craziness will persist because the time available to hit their numbers is dwindling. Fiddling with mobile devices and getting distracted by shiny bits just makes the silliness more likely.

Have you purchased a gift using Watson’s app? Have you added a Watson recipe to your holiday menu? Have you used a metasearch system like Vivisimo to solve your Big Data problems? Have you embraced Solr as a way to make Hadoop data repositories cornucopias of wisdom?

Right. The stuff may not work as one hopes. Time is running out. Quickly in real time and in imagined time.

Stephen E Arnold, November 20, 2015

An Early Computer-Assisted Concordance

November 17, 2015

An interesting post at Mashable, “1955: The Univac Bible,” takes us back in time to examine an innovative indexing project. Writer Chris Wild tells us about the preacher who realized that these newfangled “computers” might be able to help with a classically tedious and time-consuming task: compiling a book’s concordance, or alphabetical list of key words, their locations in the text, and the context in which each is used. Specifically, Rev. John Ellison and his team wanted to create the concordance for the recently completed Revised Standard Version of the Bible (also newfangled.) Wild tells us how it was done:

“Five women spent five months transcribing the Bible’s approximately 800,000 words into binary code on magnetic tape. A second set of tapes was produced separately to weed out typing mistakes. It took Univac five hours to compare the two sets and ensure the accuracy of the transcription. The computer then spat out a list of all words, then a narrower list of key words. The biggest challenge was how to teach Univac to gather the right amount of context with each word. Bosgang spent 13 weeks composing the 1,800 instructions necessary to make it work. Once that was done, the concordance was alphabetized, and converted from binary code to readable type, producing a final 2,000-page book. All told, the computer shaved an estimated 23 years off the whole process.”

The article is worth checking out, both for more details on the project and for the historic photos. How much time would that job take now? It is good to remind ourselves that tagging and indexing data has only recently become a task that can be taken for granted.

Cynthia Murrell, November 17, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta