Search Sink Hole Identified and Allegedly Paved and Converted to a Data Convenience Store

May 20, 2016

I try to avoid reading more than one write up a day about alleged revolutions in content processing and information analytics. My addled goose brain cannot cope with the endlessly recycled algorithms dressed up in Project Runway finery.

I read “Ryft: Bringing High Performance Analytics to Every Enterprise,” and I was pleased to see a couple of statements which resonated with my dim view of information access systems. There is an accompanying video in the write up. I, as you may know, gentle reader, am not into video. I prefer reading, which is the old fashioned way to suck up useful factoids.

Here’s the first passage I highlighted:

Any search tool can match an exact query to structured data—but only after all of the data is indexed. What happens when there are variations? What if the data is unstructured and there’s no time for indexing? [Emphasis added]

The answer to the question is increasing costs for sales and marketing. The early warning for amped up baloney are the presentations given at conferences and pumped out via public relations firms. (No, Buffy, no, Trent, I am not interested in speaking with the visionary CEO who hired you.)

I also highlighted:

With the power to complete fuzzy search 600X faster at scale, Ryft has opened up tremendous new possibilities for data-driven advances in every industry.”

I circled the 600X. Gentle reader, I struggle to comprehend a 600X increase in content processing. Dear Mother Google has invested to create a new chip to get around the limitations of our friend Von Neumann’s approach to executing instructions. I am not sure Mother Google has this nailed because Mother Google, like IBM, announces innovations without too much real world demonstration of the nifty “new” things.

I noted this statement too:

For the first time, you can conduct the most accurate fuzzy search and matching at the same speed as exact search without spending days or weeks indexing data.

Okay, this strikes me as a capability I would embrace if I could get over or around my skepticism. I was able to take a look at the “solution” which delivers the astounding performance and information access capability. Here’s an image from Ryft’s engineering professionals:

image

Notice that we have Spark and pre built components. I assume there are myriad other innovations at work.

The hitch in the git along is that in order to deal with certain real world information processing challenges, the inputs come from disparate systems, each generating substantial data flows in real time.

Here’s an example of a real world information access and understanding challenge, which, as far as I know, has not been solved in a cost effective, reliable, or usable manner.

image

Image source: Plugfest 2016 Unclassified.

This unclassified illustration makes clear that the little things in the sky pump out lots of data into operational theaters. Each stream of data must be normalized and then converted to actionable intelligence.

The assertion about 600X sounds tempting, but my hunch is that the latency in normalizing, transferring, and processing will not meet the need for real time, actionable, accurate outputs when someone is shooting at a person with a hardened laptop in a threat environment.

In short, perhaps the spark will ignite a fire of performance. But I have my doubts. Hey, that’s why I spend my time in rural Kentucky where reasonable people shoot squirrels with high power surplus military equipment.

Stephen E Arnold, May 20, 2016

The Kardashians Rank Higher Than Yahoo

May 20, 2016

I avoid the Kardashians and other fame chasers, because I have better things to do with my time.  I never figured that I would actually write about the Kardashians, but the phrase “never say never” comes into play.  As I read Vanity Fair’s “Marissa Mayer Vs. ‘Kim Kardashian’s Ass” : What Sunk Yahoo’s Media Ambitions?” tells a bleak story about the current happenings at Yahoo.

Yahoo has ended many of its services, let go fifteen percent of staff, and there are very few journalists left on the team.  The remaining journalists are not worried about producing golden content, they have to compete with a lot already on the Web, especially “Kim Kardashian’s ass” as they say.

When Marissa Mayer took over Yahoo as the CEO in 2012, she was determined to carve out Yahoo’s identity as a tech company.  Mayer, however, wanted Yahoo to be media powerhouse, so she hired many well-known journalists to run specific niche projects in popular areas from finance to beauty to politics.  It was not a successful move and now Yahoo is tightening its belt one more time.  The Yahoo news algorithm did not mesh with the big name journalists, the hope was that their names would soar above popular content such as Kim Kardashian’s ass.  They did not.

Much of Yahoo’s current work comes from the Alibaba market.  The result is:

“But the irony is that Mayer, a self-professed geek from Silicon Valley, threw so much of her reputation behind high-profile media figures and went with her gut, just like a 1980s magazine editor—when even magazine editors, including those who don’t profess to “get” technology, have long abandoned that practice themselves, in favor of what the geeks in Silicon Valley are doing.”

Mayer was trying to create a premiere media company, but lower quality content is more popular than top of the line journalists.  The masses prefer junk food in their news.

 

Whitney Grace, May 20, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Big Data and Value

May 19, 2016

I read “The Real Lesson for Data Science That is Demonstrated by Palantir’s Struggles · Simply Statistics.” I love write ups that plunk the word statistics near simple.

Here’s the passage I highlighted in money green:

… What is the value of data analysis?, and secondarily, how do you communicate that value?

I want to step away from the Palantir Technologies’ example and consider a broader spectrum of outfits tossing around the jargon “big data,” “analytics,” and synonyms for smart software. One doesn’t communicate value. One finds a person who needs a solution and crafts the message to close the deal.

When a company and its perceived technology catches the attention of allegedly informed buyers, a bandwagon effort kicks in. Talks inside an organization leads to mentions in internal meetings. The vendor whose products and services are the subject of these comments begins to hint at bigger and better things at conferences. Then a real journalist may catch a scent of “something happening” and writes an article. Technical talks at niche conferences generate wonky articles usually without dates or footnotes which make sense to someone without access to commercial databases. If a social media breeze whips up the smoldering interest, then a fire breaks out.

A start up should be so clever, lucky, or tactically gifted to pull off this type of wildfire. But when it happens, big money chases the outfit. Once money flows, the company and its products and services become real.

The problem with companies processing a range of data is that there are some friction inducing processes that are tough to coat with Teflon. These include:

  1. Taking different types of data, normalizing it, indexing it in a meaningful manner, and creating metadata which is accurate and timely
  2. Converting numerical recipes, many with built in threshold settings and chains of calculations, into marching band order able to produce recognizable outputs.
  3. Figuring out how to provide an infrastructure that can sort of keep pace with the flows of new data and the updates/corrections to the already processed data.
  4. Generating outputs that people in a hurry or in a hot zone can use to positive effect; for example, in a war zone, not get killed when the visualization is not spot on.

The write up focuses on a single company and its alleged problems. That’s okay, but it understates the problem. Most content processing companies run out of revenue steam. The reason is that the licensees or customers want the systems to work better, faster, and more cheaply than predecessor or incumbent systems.

The vast majority of search and content processing systems are flawed, expensive to set up and maintain, and really difficult to use in a way that produces high reliability outputs over time. I would suggest that the problem bedevils a number of companies.

Some of those struggling with these issues are big names. Others are much smaller firms. What’s interesting to me is that the trajectory content processing companies follow is a well worn path. One can read about Autonomy, Convera, Endeca, Fast Search & Transfer, Verity, and dozens of other outfits and discern what’s going to happen. Here’s a summary for those who don’t want to work through the case studies on my Xenky intel site:

Stage 1: Early struggles and wild and crazy efforts to get big name clients

Stage 2: Making promises that are difficult to implement but which are essential to capture customers looking actively for a silver bullet

Stage 3: Frantic building and deployment accompanied with heroic exertions to keep the customers happy

Stage 4: Closing as many deals as possible either for additional financing or for licensing/consulting deals

Stage 5: The early customers start grousing and the momentum slows

Stage 6: Sell off the company or shut down like Delphes, Entopia, Siderean Software and dozens of others.

The problem is not technology, math, or Big Data. The force which undermines these types of outfits is the difficulty of making sense out of words and numbers. In my experience, the task is a very difficult one for humans and for software. Humans want to golf, cruise Facebook, emulate Amazon Echo, or like water find the path of least resistance.

Making sense out of information when someone is lobbing mortars at one is a problem which technology can only solve in a haphazard manner. Hope springs eternal and managers are known to buy or license a solution in the hopes that my view of the content processing world is dead wrong.

So far I am on the beam. Content processing requires time, humans, and a range of flawed tools which must be used by a person with old fashioned human thought processes and procedures.

Value is in the eye of the beholder, not in zeros and ones.

Stephen E Arnold, May 19, 2016

Signs of Life from Funnelback

May 19, 2016

Funnelback has been silent as of late, according to our research, but the search company has emerged from the tomb with eyes wide open and a heartbeat.  The Funnelback blog has shared some new updates with us.  The first bit of news is if you are “Searchless In Seattle? (AKA We’ve Just Opened A New Office!)” explains that Funnelback opened a new office in Seattle, Washington.   The search company already has offices in Poland, United Kingdom, and New Zealand, but now they want to establish a branch in the United States.  Given their successful track record with the finance, higher education, and government sectors in the other countries they stand a chance to offer more competition in the US.  Seattle also has a reputable technology center and Funnelback will not have to deal with the Silicon Valley group.

The second piece of Funnelback news deals with “Driving Channel Shift With Site Search.”  Channel shift is the process of creating the most efficient and cost effective way to deliver information access and usage to users.  It can be difficult to implement a channel shift, but increasing the effectiveness of a Web site’s search can have a huge impact.

Being able to quickly and effectively locate information on a Web site saves time for not only more important facts, but it also can drive sales, further reputation, etc.

“You can go further still, using your search solution to provide targeted experiences; outputting results on maps, searching by postcode, allowing for short-listing and comparison baskets and even dynamically serving content related to what you know of a visitor, up-weighting content that is most relevant to them based on their browsing history or registered account.

Couple any of the features above with some intelligent search analytics, that highlight the content your users are finding and importantly what they aren’t finding (allowing you to make the relevant connections through promoted results, metadata tweaking or synonyms), and your online experience is starting to become a lot more appealing to users than that queue on hold at your call centre.”

I have written about it many times, but a decent Web site search function can make or break a site.  Not only does it demonstrate that the Web site is not professional, it does not inspire confidence in a business.  It is a very big rookie mistake to make.

 

Whitney Grace, May 19, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

IBM Uses Watson Analytics Freebie Academic Program to Lure in Student Data Scientists

May 6, 2016

The article on eWeek titled IBM Expands Watson Analytics Program, Creates Citizen Data Scientists zooms in on the expansion of the IBM  Watson Analytics academic program, which was begun last year at 400 global universities. The next phase, according to Watson Analytics public sector manager Randy Messina, is to get Watson Analytics into the hands of students beyond computer science or technical courses. The article explains,

“Other examples of universities using Watson Analytics include the University of Connecticut, which is incorporating Watson Analytics into several of its MBA courses. Northwestern University is building Watson Analytics into the curriculum of its Predictive Analytics, Marketing Mix Models and Entertainment Marketing classes. And at the University of Memphis Fogelman College of Business and Economics, undergraduate students are using Watson Analytics as part of their initial introduction to business analytics.”

Urban planning, marketing, and health care disciplines have also ushered in Watson Analytics for classroom use. Great, so students and professors get to use and learn through this advanced and intuitive platform. But that is where it gets a little shady. IBM is also interested in winning over these students and leading them into the data analytics field. Nothing wrong with that given the shortage of data scientists, but considering the free program and the creepy language IBM uses like “capturing mindshare among young people,” one gets the urge to warn these students to run away from the strange Watson guy, or at least proceed with caution into his lair.

Chelsea Kerwin, May 6, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

Mouse Movements Are the New Fingerprints

May 6, 2016

A martial artist once told me that an individual’s fighting style, if defined enough, was like a set of fingerprints.  The same can be said for painting style, book preferences, and even Netflix selections, but what about something as anonymous as a computer mouse’s movement?  Here is a new scary thought from PC & Tech Authority: “Researcher Can Indentify Tor Users By Their Mouse Movements.”

Juan Carlos Norte is a researcher in Barcelona, Spain and he claims to have developed a series of fingerprinting methods using JavaScript that measures times, mouse wheel movements, speed movement, CPU benchmarks, and getClientRects.   Combining all of this data allowed Norte to identify Tor users based on how they used a computer mouse.

It seems far-fetched, especially when one considers how random this data is, but

“’Every user moves the mouse in a unique way,’ Norte told Vice’s Motherboard in an online chat. ‘If you can observe those movements in enough pages the user visits outside of Tor, you can create a unique fingerprint for that user,’ he said. Norte recommended users disable JavaScript to avoid being fingerprinted.  Security researcher Lukasz Olejnik told Motherboard he doubted Norte’s findings and said a threat actor would need much more information, such as acceleration, angle of curvature, curvature distance, and other data, to uniquely fingerprint a user.”

This is the age of big data, but looking Norte’s claim from a logical standpoint one needs to consider that not all computer mice are made the same, some use lasers, others prefer trackballs, and what about a laptop’s track pad?  As diverse as computer users are, there are similarities within the population and random mouse movement is not individualistic enough to ID a person.  Fear not Tor users, move and click away in peace.

 

Whitney Grace, May 6, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Mastering SEO Is Mastering the Internet

May 5, 2016

Search engine optimization, better known as SEO, is one of the prime tools Web site owners must master in order for their site to appear in search results.   A common predicament most site owners find themselves in is that they may have a fantastic page, but if a search engine has not crawled it, the site might as well not exist.  There are many aspects to mastering SEO and it can be daunting to attempt to make a site SEO friendly.  While there are many guides that explain SEO, we recommend Mattias Geniar’s “A Technical Guide To SEO.”

Some SEO guides get too much into technical jargon, but Geniar’s approach uses plain speak so even if you have the most novice SEO skills it will be helpful.  Here is how Geniar explains it:

“If you’re the owner or maintainer of a website, you know SEO matters. A lot. This guide is meant to be an accurate list of all technical aspects of search engine optimisation.  There’s a lot more to being “SEO friendly” than just the technical part. Content is, as always, still king. It doesn’t matter how technically OK your site is, if the content isn’t up to snuff, it won’t do you much good.”

Understanding the code behind SEO can be challenging, but thank goodness content remains the most important aspect part of being picked up by Web crawlers.  These tricks will only augment your content so it is picked up quicker and you will receive more hits on your site.

 

Whitney Grace, May 5, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Out of the Shadows and into the OpenBazaar

May 2, 2016

If you believe the Dark Web was destroyed when Silk Road went offline, think again!  The Dark Web has roots like a surface weed, when one root remains there are dozens (or in this case millions) more to keep the weed growing.  Tech Insider reports that OpenBazaar now occupies the space Silk Road vacated, “A Lawless And Shadowy New Corner Of The Internet Is About TO Go Online.”

OpenBazaar is described as a decentralized and uncensored online marketplace where people can sell anything without the fuzz breathing down their necks. Brian Hoffman and his crew had worked on it since 2014 when Amir Taaki thought it up.  It works similar to eBay and Etsy as a peer-to-peer market, but instead of hard currency it uses bitcoin.  Since it is decentralized, it will be near impossible to take offline, unlike Silk Road.  Hoffman took over the project from Taaki and after $1 million from tech venture capital firms the testnet is live.

“There’s now a functioning version of OpenBazaar running on the “testnet.” This is a kind of open beta that anyone can download and run, but it uses “testnet bitcoin” — a “fake” version of the digital currency for running tests that doesn’t have any real value. It means the developer team can test out the software with a larger audience and iron out the bugs without any real risk.” If people lose their money it’s just a horrible idea,” Hoffman told Business Insider.”

A new user signs up for the OpenBazaar testnet every two minutes and Hoffman hopes to find all the bugs before the public launch.  Hoffman once wanted to run the next generation digital black market, but now he is advertising it as a new Etsy.  The lack of central authority means lower take rates or the fees sellers incur for selling on the site.  Hoffman says it will be good competition for online marketplaces because it will force peer-to-peer services like eBay and Etsy find new ways to add value-added services instead of raising fees on customers.

 

Whitney Grace, May 2, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

An Open Source Search Engine to Experiment With

May 1, 2016

Apache Lucene receives the most headlines when it comes to discussion about open source search software.  My RSS feed pulled up another open source search engine that shows promise in being a decent piece of software.  Open Semantic Search is free software that cane be uses for text mining, analytics, a search engine, data explorer, and other research tools.  It is based on Elasticsearch/Apache Solrs’ open source enterprise search.  It was designed with open standards and with a robust semantic search.

As with any open source search, it can be programmed with numerous features based on the user’s preference.  These include, tagging, annotation, varying file format support, multiple data sources support, data visualization, newsfeeds, automatic text recognition, faceted search, interactive filters, and more.  It has the benefit that it can be programmed for mobile platforms, metadata management, and file system monitoring.

Open Semantic Search is described as

“Research tools for easier searching, analytics, data enrichment & text mining of heterogeneous and large document sets with free software on your own computer or server.”

While its base code is derived from Apache Lucene, it takes the original product and builds something better.  Proprietary software is an expense dubbed a necessary evil if you work in a large company.  If, however, you are a programmer and have the time to develop your own search engine and analytics software, do it.  It could be even turn out better than the proprietary stuff.

 

Whitney Grace, May 1, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

A Dark Web Spider for Proactive Protection

April 29, 2016

There is a new tool for organizations to more quickly detect whether their sensitive data has been hacked.  The Atlantic discusses “The Spider that Crawls the Dark Web Looking for Stolen Data.” Until now, it was often many moons before an organization realized it had been hacked. Matchlight, from Terbium Labs, offers a more proactive approach. The service combs the corners of the Dark Web looking for the “fingerprints” of its clients’ information. Writer Kevah Waddell reveals how it is done:

“Once Matchlight has an index of what’s being traded on the Internet, it needs to compare it against its clients’ data. But instead of keeping a database of sensitive and private client information to compare against, Terbium uses cryptographic hashes to find stolen data.

“Hashes are functions that create an effectively unique fingerprint based on a file or a message. They’re particularly useful here because they only work in one direction: You can’t figure out what the original input was just by looking at a fingerprint. So clients can use hashing to create fingerprints of their sensitive data, and send them on to Terbium; Terbium then uses the same hash function on the data its web crawler comes across. If anything matches, the red flag goes up. Rogers says the program can find matches in a matter of minutes after a dataset is posted.”

What an organization does with this information is, of course, up to them; but whatever the response, now they can implement it much sooner than if they had not used Matchlight. Terbium CEO Danny Rogers reports that, each day, his company sends out several thousand alerts to their clients. Founded in 2013, Terbium Labs is based in Baltimore, Maryland. As of this writing, they are looking to hire a software engineer and an analyst, in case anyone here is interested.

 

Cynthia Murrell, April 29, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta