Alphabet Google Search: Dominance Ending?

February 28, 2016

I read “Will PageRank Expiration Threaten Google’s Dominance.” The main point is that Google’s shift to artificial intelligence “hurt Google Search’s market share and its stock price?”

The write up references the 1997 write up about the search engine’s core algorithms. (There is no reference to the work by Jon Kleinberg and the Clever system, which is understandable I suppose.) Few want to view Google as a me-too outfit, “cleverly” overlooking the firm’s emulation strategy. Think GoTo.com/Overture/Yahoo in the monetization mechanism.

The write up states:

The Google Search today transcends PageRank: Google has a myriad of proprietary technology.

I agree. Google is not an open source centric outfit. When was the last time, Google made it easy to locate its employees’ technical papers, presentations at technical conferences, or details about products and services which just disappear. Orkut, anyone?

The write up shifts its focus to some governance issues; for example, Google’s Loon balloon, solving death, etc. There is a reference to Google’s strategy concerning mobile phones.

Stakeholders may want to worry because Google is dependent on search for the bulk of its revenues. I learned:

From Alphabet’s recent 10-k and Google’s Search revenues from Statista, you will realize that Search has been ~92%, ~90%, ~90% of total revenues in 2013-2015 respectively.

No big news here.

The core challenge for analysts will be to figure out if a shift to artificial intelligence methods for search will have unforeseen consequences. For example, maybe Google has figured out that the cost of indexing the Web is too expensive. AI may be a way to reduce costs of indexing and serving results. Google may realize that the shift from desktop based queries to mobile queries threatens Google’s ability to deliver information with the same perceived relevance that the desktop experience created in users’ perceptions.

Alphabet Google is at a cross road. The desktop model from the late 1990s is less and less relevant in 2016. Like any other company faced with change, Google’s executives find themselves in the same boat as other online vendors. Today’s revenues may not be the revenues of tomorrow.

Will Alphabet Google face the information headwinds which buffeted Autonomy, Convera, Endeca, Fast Search & Transfer, and similar information access vendors? Is Google facing a long walk down the path which America Online and Yahoo followed? Will the one trick revenue pony die when it cannot adapt to the mobile jungle?

Good questions. Answers? Tough to find.

Stephen E Arnold, February 28, 2016

DuckDuckGo: Challenging Google Is Not a Bad Idea

February 25, 2016

I read “The Founder of DuckDuckGo Explains Why Challenging Google Isn’t Insane.” I noted several statements in the write up; namely:

  • DuckDuckGo delivers three billion searches a year, compared to Google’s trillion-plus search per year. The zeros can be confusing to an addled goose like me. Let me say that Google is delivering more search results that DuckDuckGo.com
  • DuckDuckGo’s revenues are in 2015 were more than $1 million. Google’s revenues were about $75 billion. Yep, more zeros.
  • It used to take Google six months to index pages on the Internet. (I thought that Google indexed from its early days based on a priority algorithm. Some sites were indexed in a snappy manner; others, like the National Railway Retirement Board, less snappily. I am probably dead wrong here, but it is a nifty point to underscore Google’s slow indexing. I just don’t think it was or is true.)
  • DuckDuckGo was launched in 2008. The company is almost eight years old.
  • Google’s incognito mode is a myth. What about those Google cookies? (I think the incognito mode nukes those long lived goodies.)

Here’s the passage I highlighted:

Adams (the interviewer): I thought the government could track me whether I use DuckDuckGo or not.

Weinberg (the founder of DuckDuckGo): No they can’t. They can get to your Google searches, but if you use DuckDuckGo it’s completely encrypted between you and us. We don’t store anything. So there’s no data to get. The government can’t subpoena us for records because we don’t have records.

DuckDuckGo beats the privacy drum. That’s okay, but the privacy of Tor and I2P can be called into question. Is it possible that there are systems and methods to track user queries with or without the assistance of the search engine system? My hunch is that there are some interesting avenues to explore from companies providing tools to various government agencies. What about RACs, malware, metadata analyses, etc.? Probably I am wrong again. RATs. I have no immunity from my flawed information. I may have to grab my swim fins and go fin-fishing. I could also join a hacking team and vupen it up.

Stephen E Arnold, February 25, 2016

Data Insight: Common Sense Makes Sense

February 25, 2016

I am skeptical about lists of problems which hot buzzwords leave in their wake. I read “Why Data Insight Remains Elusive,” which I though was another content marketing pitch to buy, buy, buy. Not so. The write up contains some clearly expressed, common sense reminds for those who want to crunch big data and point and click their way through canned reports. Those who actually took the second semester of Statistics 101 know that ignoring the data quality and the nitty gritty of the textbook procedures can lead to bone head outputs.

The write up identifies some points to keep in mind, regardless of which analytics vendor system a person is using to make more informed or “augmented” decisions.

Here’s the pick of the litter:

  1. Manage the data. Yep, time consuming, annoying, and essential. Skip this step at your decision making peril.
  2. Manage the indexing. The buzzword is metadata, but assigning keywords and other indexing items makes the difference when trying to figure out who, what, why, when, and where. Time? Yep, metadata which not even the Alphabet Google thing does particularly well.
  3. Create data models. Do the textbook stuff. Get the model wrong, and what happens? Failure on a scale equivalent to fumbling the data management processes.
  4. Visualization is not analytics. Visualization makes outputs of numerical recipes appear in graphical form. Do not confuse Hollywood outputs with relevance, accuracy, or math on point to the problem one is trying to resolve.
  5. Knee jerking one’s way through analytics. Sorry, reflexes are okay but useless without context. Yep, have a problem, get the data, get the model, test, and examine the outputs.

Common sense. Most basic stuff was in the textbooks for one’s college courses. Too bad more folks did not internalize those floorboards and now seek contractors to do a retrofit. Quite an insight when the bill arrives.

Stephen E Arnold, February 25, 2016

Social Media Search: Will Informed People Respond?

January 19, 2016

I recall asking for directions recently. There were three young people standing outside a bookstore. I wanted to know where the closest ice cream shop was. The three looked at me, smiled, looked at one another, smiled, and one of them said: “No clue.”

I like the idea of asking a group of people for information, but the experiences I have suggest that one has to be careful. Ask a tough question and no one may know the answer. Ask a question in an unfamiliar way such as “shop” instead of Dairy Queen, and the group may not have the faintest idea what one is talking about.

These thoughts influenced my reading of “Social Media: The Next Best Search Engine.” The title seemed to suggest that I could rely on my old school tricks but I would be silly not to use Facebook and Twitter to get information. That’s okay, but I don’t use Facebook, and the Twitter tweet thing seems to be down.

Bummer.

The write up reports:

Many consumers skip right over Google or Yahoo when conducting a search, and instead type it into social media networks.

The approach may work for peak TV and Miley Cyrus news, but I find analysis of social media intercept data more helpful for some of my queries.

Here’s the trick, according to the article:

To make sure you are responding to this growing trend, be present on social media on the channels that best make sense for your company. …The best way to optimize your posts is through hashtags and the content itself. For Facebook, Twitter, Google+ and Instagram, be sure to include relevant hashtags in your posts so that users can find your posts. For sites such as LinkedIn and Yelp which don’t utilize hashtags, make sure that you fill out your profiles as completely as possible.

Okay, indexing and details.

Search? I don’t think I will change my methods.

Stephen E Arnold, January 19, 2016

Search Is Marketing and Lots of Other Stuff Like Semantics

January 12, 2016

I spoke with a person who asked me, “Have you seen the 2013 Dave Amerland video? The video in question is “Google Semantic Search and its Impact on Business.”

I hadn’t. I watched the five-minute video and formed some impressions / opinions about the information presented. Now I wish I had not invested five minutes in serial content processing.

First, the premise is that search is marketing does not match up with my view of search. In short, search is more than marketing, although some view search as essential to making a sale.

Second, the video generates buzzwords. There’s knowledge graph, semantic, reputation, Big Data, and more. If one accepts the premise that search is about sales, I am not sure what these buzzwords contribute. The message is that when a user looks for something, the system should display a message that causes a sale. Objectivity does not have much to do with this, nor do buzzwords.

Third, presentation of the information was difficult for me to understand. My attention was undermined by the wild and wonderful assertions about the buzzwords. I struggled with “from stings to things, from Web sites to people.” What?

The video is ostensibly about the use of “semantics” in content. I am okay with semantic processes. I understand that keeping words and metaphors consistent are helpful to a human and to a Web indexing system.

But the premise. I have a tough time buying in. I want search to return high value, on point content. I want those who create content to include helpful information, details about sources, and markers that make it possible for a reader to figure out what’s sort of accurate and what’s opinion.

I fear that the semantics practiced in this video shriek, “Hire me.” I also note that the video is a commercial for a book which presumably amplifies the viewpoint expressed in the video. That means the video vocalizes, “Buy my book.”

Heck, I am happy if I can an on point result set when I run a query. No shrieking. No vocalization. No buzzwords. Will objective search be possible?

Stephen E Arnold, January 12, 2016

Dark Web: How Big Is It?

January 11, 2016

I read “Big Data and the Deep, Dark Web.” The write up raises an important point. I question the data, however.

First, there is the unpleasant task of dealing with terminology. A number of different phrases appear in the write up; for example:

  • Dark Web
  • Deep Web
  • Surface Web
  • Web World Wide

Getting hard data about the “number” of Web pages or Web sites is an interesting problem. I know that popular content gets indexed frequently. That makes sense in an ad-driven business model. I know that less frequently indexed content often is an unhappy consequence of resource availability. It takes time and money to index every possible link on each index cycle. I know that network latency can cause an indexing system to move on to another, more responsive site. Then there is bad code, intentional obfuscation such as my posting content on Xenky.com for those who attend my LEA/Intelligence lectures sponsored by Telestrategies in information friendly Virginia.

Then what is the difference between the Surface Web, which I call the Clear Web which allows access to a Wall Street Journal article when I click a link from one site and not from another. The Wall Street Journal requires a user name and password—sometimes. So what is this? A Clear Web site or a visible, not accessible site?

The terminology is messy.

Bright Planet coined the Deep Web moniker decades ago. The usage was precise: These are sites which are not static; for example dynamically generated Web pages. An example would be the Southwest Airlines fare page. A user has to click in order to get the pricing options. Bright Planet also included password protected sites. Examples range from a company’s Web page for employees to sites which require the user to pay money to gain access.

Then we have the semi exciting Dark Web, which can also be referenced as the Hidden Web.

Most folks writing about the number of Web sites or Web pages available in one of these collections are pretty much making up data.

Here’s an example of fanciful numerics. Note the disclaimers which is a flashing yellow caution light for me:

Accurately determining the size of the deep web or the dark web is all but impossible. In 2001, it was estimated that the deep web contained 7,500 terabytes of information. The surface web, by comparison, contained only 19 terabytes of content at the time. What we do know is that the deep web has between 400 and 550 times more public information than the surface web. More than 200,000 deep web sites currently exist. Together, the 60 largest deep web sites contain around 750 terabytes of data, surpassing the size of the entire surface web by 40 times. Compared with the few billion individual documents on the surface web, 550 billion individual documents can be found on the deep web. A total of 95 percent of the deep web is publically accessible, meaning no fees or subscriptions.

Where do these numbers come from? How many sites require Tor to access their data. I am working on my January Webinar for Telestrategies. Sorry. Attendance is limited to those active in LEA/Intelligence/Security. I queried one of the firm’s actively monitoring and indexing Dark Web content. That company which you may want to pay attention to is Terbium Labs. Visit them at www.terbiumlabs.com. Like most of the outfits involved in Dark Web analytics, certain information is not available. I was able to get some ball park figures from one of the founders. (He is pretty good with counting since he is a sci-tech type with industrial strength credentials in the math oriented world of advanced physics.

Here’s the information I obtained which comes from Terbium Labs’s real time monitoring of the Dark Web:

We [Terbium Labs] probably have the most complete picture of it [the Dark Web] compared to most anyone out there.  While we don’t comment publicly on our specific coverage, in our estimation, the Dark Web, as we loosely define it, consists of a few tens of thousands or hundreds of thousands of domains, including light web paste sites and carding forums, Tor hidden services, i2p sites, and others.  While the Dark Web is large enough that it is impossible to comprehensively consume by human analysts, compared with the billion or so light web domains, it is relatively compact.

My take is that the Dark Web is easy to talk about. it is more difficult to obtained informed analysis of the Dark Web, what is available, which sites are operated by law enforcement and government agencies which sites are engaged actively is Dark Web commerce, information exchange, publishing, and other tasks.

One final point: The Dark Web uses Web protocols. In a sense, the Dark Web is little more than a suburb of the metropolis that Google indexes selectively. For more information about the Dark Web and its realities, check out my forthcoming Dark Web Notebook. If you want to reserve a copy, email benkent2020 at yahoo dot com. LEA, intel, and security professionals get a discount. Others pay $200 per copy.

Stephen E Arnold, January 11, 2016

Google and Students: The Quest for Revenue

January 7, 2016

The Alphabet Google thing is getting more focused in its quest for revenue in the post desktop search world. I read “Google Is Tracking Students As It Sells More Products to Schools, Privacy Advocates Warn.” I remember the good old days when the Google was visiting universities to chat about its indexing of the institutions’ Web sites and the presentations related to the book scanning project. This write up seems, if Jeff Bezos’ newspaper is spot on, to suggest that the Alphabet Google thing is getting more interested in students, not just the institutions.

I read:

More than half of K-12 laptops or tablets purchased by U.S. schools in the third quarter were Chromebooks, cheap laptops that run Google software…. But Google is also tracking what those students are doing on its services and using some of that information to sell targeted ads, according to a complaint filed with federal officials by a leading privacy advocacy group.

The write up points out:

In just a few short years, Google has become a dominant force as a provider of education technology…. Google’s fast rise has partly been because of low costs: Chromebooks can often be bought in the $100 to $200 range, a fraction of the price for a MacBook. And its software is free to schools.

Low prices. Well, Amazon is into that type of marketing too, right? Collecting data. Isn’t Amazon gathering data for its recommendations service?

My reaction to the write up is that the newspaper will have more revelations about the Alphabet Google thing. The security and privacy issue is one that has the potential to create some excitement in the land of online giants.

Stephen E Arnold, January 7, 2015

Did Apple Buy Topsy for an Edge over Google

January 7, 2016

A couple years ago, Apple bought Topsy Labs, a social analytics firm and Twitter partner out of San Francisco. Now, in “Apple Inc. Acquired Topsy to Beat Google Search Capabilities,” BidnessEtc reports on revelations from Topsy’s former director of business development, Aaron Hayes-Roth. Writer Martin Blanc reveals:

“The startup’s tools were considered to be fast and reliable by the customers who used them. The in-depth analysis was smart enough to go back to 2006 and provide users with analytics and data for future forecasts. Mr. Roth and his team always had a curiosity attached to how Apple would use Twitter in its ecosystem. Apple does not make use of Twitter that much; the account was made in 2011 and there aren’t many tweets that come out of the social network. However, Mr. Roth explains that it was not Twitter data that Apple had its eye on; it was the technology that powered it. The architecture of Topsy makes it easier for systems to search large amounts of data extremely fast with impressive indexing capabilities. Subsequently, Apple’s ecosystem has developed quite a lot since Siri was first introduced with the iPhone 4s. The digital assistant and the Spotlight search are testament to how far Apple’s search capabilities have come.”

The article goes on to illustrate some of those advances, then points out the ongoing rivalry between Apple and Google. Are these improvements the result of Topsy’s tech? And will they give Apple the edge they need over their adversary? Stay tuned.

 

Cynthia Murrell, January 7, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

IBM Generates Text Mining Work Flow Diagram

January 4, 2016

I read “Deriving Insight Text Mining and Machine Learning.” This is an article with a specific IBM Web address. The diagram is interesting because it does not explain which steps are automated, which require humans, and which are one of those expensive man-machine processes. When I read about any text related function available from IBM, I think about Watson. You know, IBM’s smart software.

Here’s the diagram:

image

If you find this hard to read, you are not in step with modern design elements. Millennials, I presume, love these faded colors.

Here’s the passage I noted about the important step of “attribute selection.” I interpret attribute selection to mean indexing, entity extraction, and related operations. Because neither human subject matter specialists nor smart software perform this function particularly well, I highlighted in red ink in recognition of IBM’s 14 consecutive quarters of financial underperformance:

Machine learning is closely related to and often overlaps with computational statistics—a discipline that also specializes in prediction-making. It has strong ties to mathematical optimization, which delivers methods, theory and application domains to the field. It is employed in a range of computing tasks where designing and programming explicit algorithms is infeasible. Example applications include spam filtering, optical character recognition (OCR), search engines and computer vision. Text mining takes advantage of machine learning specifically in determining features, reducing dimensionality and removing irrelevant attributes. For example, text mining uses machine learning on sentiment analysis, which is widely applied to reviews and social media for a variety of applications ranging from marketing to customer service. It aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. The attitude may be his or her judgment or evaluation, affective state or the intended emotional communication. Machine learning algorithms in text mining include decision tree learning, association rule learning, artificial neural learning, inductive logic programming, support vector machines, Bayesian networks, genetic algorithms and sparse dictionary learning.

Interesting, but how does this IBM stuff actually work? Who uses it? What’s the payoff from these use cases?

More questions than answers to explain the hard to read diagram, which looks quite a bit like a 1998 Autonomy graphic. I recall being able to read the Autonomy image, however.

Stephen E Arnold, December 30, 2015

Weekly Watson: In the Real World

January 2, 2016

I want to start off the New Year with look at Watson in the real world. My real world is circumscribed by abandoned coal mines and hollows in rural Kentucky. I am pretty sure this real world is not the real world assumed in “IBM Watson: AI for the Real World.” IBM has tapped Bob Dylan, a TV game show, and odd duck quasi chemical symbols to communicate the importance of search and content processing.

The write up takes a different approach. In fact, the article begins with an interesting comment:

Computers are stupid.

There you go. A snazzy one liner.

The purpose of the reminder that a man made device is not quite the same as one’s faithful boxer dog or next door neighbor’s teen is startling.

The article summarizes an interview with a Watson wizard, Steven Abrams, director of technology for the Watson Ecosystem. This is one of those PR inspired outputs which I quite enjoy.

The write up quotes Abrams as saying:

“You debug Watson’s system by asking, ‘Did we give it the right data?'” Abrams said. “Is the data and experience complete enough?”

Okay, but isn’t this Dr. Mike Lynch’s approach. Lynch, as you may recall, was the Cambridge University wizard who was among the first to commercialize “learning” systems in the 1990s.

According to the write up:

Developers will have data sets they can “feed” Watson through one of over 30 APIs. Some of them are based on XML or JSON. Developers familiar with those formats will know how to interact with Watson, he [Abrams] explained.

As those who have used the 25 year old Autonomy IDOL system know, preparing the training data takes a bit of effort. Then as the content from current content is fed into the Autonomy IDOL system, the humans have to keep an eye on the indexing. Ignore the system too long, and the indexing “drifts”; that is, the learned content is not in tune with the current content processed by the system. Sure, algorithms attempt to keep the calibrations precise, but there is that annoying and inevitable “drift.”

IBM’s system, which strikes me as a modification of the Autonomy IDOL approach with a touch of Palantir analytics stirred in is likely to be one expensive puppy to groom for the dog show ring.

The article profiles the efforts of a couple of IBM “partners” to make Watson useful for the “real” world. But the snip I circled in IBM red-ink red was this one:

But Watson should not be mistaken for HAL. “Watson will not initiate conduct on its own,” IBM’s Abrams pointed out. “Watson does not have ambition. It has no objective to respond outside a query.” “With no individual initiative, it has no way of going out of control,” he continued. “Watson has a plug,” he quipped. It can be disconnected. “Watson is not going to be applied without individual judgment … The final decision in any Watson solution … will always be [made by] a human, being based on information they got from Watson.”

My hunch is that Watson will require considerable human attention. But it may perform best on a TV show or in a motion picture where post production can smooth out the rough edges.

Maybe entertainment is “real”, not the world of a Harrod’s Creek hollow.

Stephen E Arnold, January 2, 2016

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta