Hitachi Digs into Enterprise Search

November 23, 2016

HItachi Data Systems has embraced “content intelligence.” My recollection is that the “search” underlying the HItachi Content Platform is Perfect Search, a proprietary system which emphasized its performance features, not its ease of use for system administrators.

“Hitachi Adds Enterprise Search to Object Store” informs me that:

Hitachi Data Systems today debuted Content Intelligence, a new offering that adds a slew of enterprise search and analytic capabilities to its object-based file system.

Slew?

The system supports multi tenant, cloud scale deployments. The block diagram for the system looks like this:

image

According to a Hitachi professional, the new system will be “invaluable.” That is, I presume, a “slew” of value.

Hitachi was the second best system for object storage according to the big moon, mid tier consulting firm Gartner Group. The number one system was IBM Watson’s cell mate CleverSafe dsNet. (This is not the IBM Almaden Clever system for relevance determination.)

Other features, in addition to search, are a cloud gateway component, a file synchronization tool, and the ability to share access. For more information about the system, you can read “Better Object Storage with Hitachi Content Platform 2014.”

Stephen E Arnold, November 23, 2016

Hear That Bing Ding: A Warning for Google Web Search

November 23, 2016

Bing. Bing. Bing. The sound reminds me of a broken elevator door in the Block & Kuhl when I was but a wee lad. Bing. Bing. Bing. Annoying? You bet.

I read “Microsoft Corporation Can Defeat Alphabet Inc in Search.” I enjoy these odd, disconnected from the real world write ups predicting that Microsoft will trounce Google in a particular niche. This particular write up seizes upon the fluff about Microsoft having an “intelligence fabric.” Then with a spectacular leap, which ignores the fact that more than 90 percent of the humans use Google Web search, suggests that Bing will be the next big thing in Web search.

Get real.

Bing, after two decades of floundering, allegedly is profitable. No word on how long it will take to pay back the money Microsoft has invested in Web search over these 4,000 days of stumbling.

I highlighted this passage in the write up:

Rik van der Kooi, corporate vice president of Microsoft Search Advertising, referred to Bing as an “intelligence fabric” that has been embedded into Windows 10, Cortana, Xbox and other products, including Hololens. He went on to say the future Bing will be personal, pervasive and offer a personal experience so much that it “might not be obvious users are even interacting with the search engine.

I think I understand. Microsoft is everywhere. Microsoft Bing is embedded. Therefore, Microsoft beats Google Web search.

Great thinking.

I do like this passage:

This is a bold call considering that Google owned 89.38% of the global desktop search engine market, while Microsoft owned 4.2% as of July 2016, according to data provided by Statista. With MSFT’s endeavors to create an integrated ecosystem, however, the long-term scale is tipping in the favor of Microsoft stock. That’s because Microsoft’s traditional business is entrenched into many people’s lives as well as business operations. For instance, the majority of desktop devices run on Windows.

Yep, there are lots of desktops still. However, there are more mobile devices. If I am not mistaken, Google’s Android runs more than 80 percent of these devices. Add desktop and mobile and what do you get? No dominance of Web search by Bing the way I understand the situation.

Sure, I love the Bing thing. I have some affection for Qwant.com, Yandex.com, and Inxight.com too. But Microsoft has yet to demonstrate that it can deliver a Web search system which is able to change the behaviors of today’s users. Look at the Google in the word processing space. Microsoft continues to have an edge and Google has been trying for more than a decade to make Word an afterthought. That hasn’t happened. Inertia is a big factor.

Search for growing market share on Bing. What’s that answer look like? Less than five percent of the Web search market? Oh, do that query on Google by the way.

Stephen E Arnold, November 23, 2016

Writing That Is Never Read

November 23, 2016

It is inevitable in college that you were forced to write an essay.  Writing an essay usually requires the citation of various sources from scholarly journals.  As you perused the academic articles, the thought probably crossed your mind: who ever reads this stuff?  Smithsonian Magazine tells us who in the article, “Academics Write Papers Arguing Over How Many People Read (And Cite) Their Papers.”  In other words, themselves.

Academic articles are read mostly by their authors, journal editors, and the study’s author write, and students forced to cite them for assignments.  In perfect scholarly fashion, many academics do not believe that their work has a limited scope.  So what do they do?  They decided to write about it and have done so for twenty years.

Most academics are not surprised that most written works go unread.  The common belief is that it is better to publish something rather than nothing and it could also be a requirement to keep their position.  As they are prone to do, academics complain about the numbers and their accuracy:

It seems like this should be an easy question to answer: all you have to do is count the number of citations each paper has. But it’s harder than you might think. There are entire papers themselves dedicated to figuring out how to do this efficiently and accurately. The point of the 2007 paper wasn’t to assert that 50 percent of studies are unread. It was actually about citation analysis and the ways that the internet is letting academics see more accurately who is reading and citing their papers. “Since the turn of the century, dozens of databases such as Scopus and Google Scholar have appeared, which allow the citation patterns of academic papers to be studied with unprecedented speed and ease,” the paper’s authors wrote.

Academics always need something to argue about, no matter how miniscule the topic. This particular article concludes on the note that someone should get the number straight so academics can move onto to another item to argue about.  Going back to the original thought a student forced to write an essay with citations also probably thought: the reason this stuff does not get read is because they are so boring.

Whitney Grace, November 23, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Exit Shakespeare, for He Had a Coauthor

November 22, 2016

Shakespeare is regarded as the greatest writer in the English language.  Many studies, however, are devoted to the theory that he did not pen all of his plays and poems.  Some attribute them to Francis Bacon, Edward de Vere, Christopher Marlowe, and others.  Whether Shakespeare was a singular author or one of many, two facts remain:  he was a dirty, old man and it could be said he plagiarized his ideas from other writers.  Shall he still be regarded as the figurehead for English literature?

Philly.com takes the Shakespeare authorship into question in the article, “Penn Engineers Use Big Data To Show Shakespeare Had Coauthor On ‘Henry VI’ Plays.”  Editors of a new edition of Shakespeare’s complete works listed Marlowe as a coauthor on the Henry VI plays due to a recent study at the University of Pennsylvania.  Alejandro Ribeiro used his experience researching networks could be applied to the Shakespeare authorship question using big data.

Ribeiro learned that Henry VI was among the works for which scholars thought Shakespeare might have had a co-author, so he and lab members Santiago Segarra and Mark Eisen tackled the question with the tools of big data.  Working with Shakespeare expert Gabriel Egan of De Montfort University in Leicester, England, they analyzed the proximity of certain target words in the playwright’s works, developing a statistical fingerprint that could be compared with those of other authors from his era.

Two other research groups had the same conclusion with other analytical techniques.  The results from all three studies were enough to convince the lead general editor of the New Oxford Shakespeare Gary Taylor, who decided to list Marlowe as a coauthor to Henry VI.  More research has been conducted to determine other potential Shakespeare coauthors and six more will also be credited in the New Oxford editions.

Ribeiro and his team created “word-adjacency networks” that discovered patterns in Shakespeare’s writing style and six other dramatists.  They discovered that many scenes in Henry VI were non-written in Shakespeare’s style, enough to prove a coauthor.

Some Shakespeare purists remain against the theory that Shakespeare did not pen all of his plays, but big data analytics proves many of the theories that other academics have theorized for generations.  The dirty old man was not old alone as he wrote his ditties.

Whitney Grace, November 22, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

Hacking the Internet of Things

November 17, 2016

Readers may recall that October’s DoS attack against internet-performance-management firm Dyn, which disrupted web traffic at popular sites like Twitter, Netflix, Reddit, and Etsy. As it turns out, the growing “Internet of Things (IoT)” facilitated that attack; specifically, thousands of cameras and DVRs were hacked and used to bombard Dyn with page requests. CNet examines the issue of hacking through the IoT in, “Search Engine Shodan Knows Where Your Toaster Lives.”

Reporter Laura Hautala informs us that it is quite easy for those who know what they’re doing to access any and all internet-connected devices. Skilled hackers can do so using search engines like Google or Bing, she tells us, but tools created for white-hat researchers, like Shodan, make the task even easier. Hautala writes:

While it’s possible hackers used Shodan, Google or Bing to locate the cameras and DVRs they compromised for the attack, they also could have done it with tools available in shady hacker circles. But without these legit, legal search tools, white hat researchers would have a harder time finding vulnerable systems connected to the internet. That could keep cybersecurity workers in a company’s IT department from checking which of its devices are leaking sensitive data onto the internet, for example, or have a known vulnerability that could let hackers in.

Even though sites like Shodan might leave you feeling exposed, security experts say the good guys need to be able to see as much as the bad guys can in order to be effective.

Indeed. Like every tool ever invented, the impacts of Shodan depend on the intentions of the people using it.

Cynthia Murrell, November 17, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Black-Hat SEO Tactics Google Hates

November 16, 2016

The article on Search Engine Watch titled Guide to Black Hat SEO: Which Practices Will Earn You a Manual Penalty? follows up on a prior article that listed some of the sob stories of companies caught by Google using black-hat practices. Google does not take kindly to such activities, strangely enough. This article goes through some of those practices, which are meant to “falsely manipulate a website’s search position.”

Any kind of scheme where links are bought and sold is frowned upon, however money doesn’t necessarily have to change hands… Be aware of anyone asking to swap links, particularly if both sites operate in completely different niches. Also stay away from any automated software that creates links to your site. If you have guest bloggers on your site, it’s good idea to automatically Nofollow any links in their blog signature, as this can be seen as a ‘link trade’.

Other practices that earned a place on the list include automatically generated content, cloaking and irrelevant redirects, and hidden text and links. Doorway pages are multiple pages for a key phrase that lead visitors to the same end destination. If you think these activities don’t sound so terrible, you are in great company. Mozilla, BMW, and the BBC have all been caught and punished by Google for such tactics. Good or bad? You decide.

Chelsea Kerwin, November 16, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Azure Search Overview

November 15, 2016

I know that Microsoft is a world leader in search and retrieval. Look at the company’s purchase of Fast Search & Transfer in 2008. Look at the search in Windows 7, 8, and 10. Look at the Microsoft research postings listed in Bing. I am convinced.

I did learn a bit more about Azure Search in “Microsoft Azure Search and Azure Backup Arrive in Canada.” I learned that search is now a service; for example:

Azure Search is Microsoft search-as-a-service solution for cloud. It allows customers to add search to their applications using REST API or .NET SDK. Microsoft handles the server and infrastructure management, meaning developers don’t need to worry about understanding search.

Here are the features I noted from the write up:

  • Query syntax including Boolean and Lucene conventions
  • Support for 56 different languages
  • Search suggestions for auto complete
  • Hit highlighting
  • Geo spatial support
  • Faceted navigation just like Endeca in 1998

The most interesting statement in the write up was in my opinion:

Microsoft handles the server and infrastructure management, meaning developers don’t need to worry about understanding search.

I love that one does not need to understand search. That’s what makes search so darned fascinating today. Systems which require no understanding. I also believe everything that a search system presents in a list of relevance ranked results. I really do. I, for example, believed that Fast Search & Transfer was the most wonderful search system in the world until, well, the investigators arrived. Azure is even more wonderful as a cloud appliance thing that developers do not need to understand. Great and wonderful.

Stephen E Arnold, November 15, 2016

The House Cleaning of Halevy Dataspace: A Web Curiosity

November 14, 2016

I am preparing three seven minute videos. That effort will be one video each week starting on 20 December 2016. The subject is my Google Trilogy, published by an antique outfit which has drowned in River Avon. The first video is about the 2004 monograph, The Google Legacy. I coined the term “Googzilla” in that 230 page discussion of how Google became baby Google. The second video summarizes several of the take aways from Google: The Calculating Predator, published in 2007. The key to the monograph is the bound phrase “calculating predator.” Yep, not the happy little search out most know and love. The third video hits the main points of Google: The Digital Gutenberg, published in 2009. The idea is that Google spits out more digital content than almost anyone. Few think of the GOOG as the content generator the company has become. Yep, a map is a digital artifact.

Now to the curiosity. I wanted to reference the work of Dr. Alon Halevy, a former University of Washington professor and founder of Nimble and Transformic. I had a stack of links I used when I was doing the research for my predator book. Just out of curiosity I started following the links. I do have PDF versions of most of the open source Halevy-centric content I located.

But guess what?

Dr. Alon Halevy has disappeared. I could not locate the open source version of his talk about dataspaces. I could not locate the Wayback Machine’s archived version of the Transformic.com Web site. The links returned these weird 404 errors. My assumption was that Wayback’s Web pages resided happily on the outfit’s servers. I was incorrect. Here’s what I saw:

image

I explored the bound phrase “Alon Halvey” with various other terms only to learn that the bulk of the information has disappeared. No PowerPoints, no much substantive information. There were a few “information objects” which have not yet disappeared; for example:

  • An ACM blog post which references “the structured data team” and Nimble and Transformic
  • A Google research paper which will not make those who buy into David Gelerter’s The Tides of the Mind thesis
  • A YouTube video of a lecture given at Technion.

I found the gap between my research gathered in 2005 to 2007 interesting. I asked myself, “How did I end up with so many dead links about a technology I have described as one of the most important in database, data management, data analysis, and information retrieval?

Here are the answers I formulated:

  1. The Web is a lousy source of information. Stuff just disappears like the Darpa listing of open source Dark Web software, blogs, and Web sites
  2. I did really terrible research and even worse librarian type behavior. Yep, mea culpa.
  3. Some filtering procedures became a bit too aggressive and the information has been swept from assorted indexes
  4. The Wayback Machine ran off the rails and pointed to an actual 2005 Web site which its system failed to copy when the original spidering was completed.
  5. Gremlins. Hey, they really do exist. Just ask Grace Hopper. Yikes, she’s not available.

I wanted to mention this apparent or erroneous scrubbing. The story in this week HonkinNews video points out that 89 percent of journalists do their research via Google. Now if information is not in Google, what does that imply for a “real” journalist trying to do an objective, comprehensive story? I leave it up to you, gentle reader, to penetrate this curiosity.

Watch for the Google Trilogy seven minute videos on December 20, 2016, December 27, 2016, and

Stephen E Arnold, November 14, 2016, and January 3, 2017. Free. No pay wall. No Patreon.com pleading. No registration form. Just honkin’ news seven days a week and some video shot on an old Bell+Howell camera in a log cabin in rural Kentucky.

Project Tor Releases the Browser Manual

November 14, 2016

Tor Browser, the gateway to Dark Web has got its user manual that tells users a step-by-step procedure to download, install use and uninstall the browser in the most efficient manner.

On the official Tor blog titled Announcing the Tor Browser User Manual it says:

The community team is excited to announce the new Tor Browser User Manual! The manual is currently only available in English. We will be adding more languages in the near future, as well as adding the manual to Transifex.

Web users are increasingly adopting secure browsers like Tor that shields them from online tracking. With this manual, users who are not well-versed with Dark Web and want to access it or want to surf the web anonymously will get detailed instructions on doing so.

Some of the critical areas (apart from basic instructions like download and install) covered in the manual include – circumventing the network restrictions, managing identities, securely connecting to Tor, managing plugins, and troubleshooting most common problems.

The manual was created after taking feedback from various mailing lists and IRC forums, as the blog points out:

During the creation of this manual, community feedback was requested over various mailing lists / IRC channels. We understand that many people who read this blog are not part of these lists / channels, so we would like to request that if you find errors in the manual or have feedback about how it could be improved, please open a ticket on our bug tracker and set the component to “community”.

The manual will soon be released in other major languages that will benefit non-English speaking users. The aim is to foster growth and adoption of Tor, however, will only privacy-conscious users will be using the browser?

Vishal Ingole, November 14, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Why Search When You Can Discover

November 11, 2016

What’s next in search? My answer is, “No search at all. The system thinks for you.” Sounds like Utopia for the intellectual couch potato to me.

I read “The Latest in Search: New Services in the Content Discovery Marketplace.” The main point of the write up is to highlight three “discovery” services. A discovery service is one which offers “information users new avenues to the research literature.”

See, no search needed.

The three services highlighted are:

  • Yewno, which is powered by an inference engine. (Does anyone remember the Inference search engine from days gone by?). The Yewno system uses “computational analysis and a concept map.” The problem is that it “supplements institutional discovery.” I don’t know what “institutional discovery” means, and my hunch is that folks living outside of rural Kentucky know what “institutional discovery” means. Sorry to be so ignorant.
  • ScienceOpen, which delivers a service which “complements open Web discovery.” Okay. I assume that this means I run an old fashioned query and ScienceOpen helps me out.
  • TrendMD, which “serves as a classic “onward journey tool” that aims to generate relevant recommendations serendipitously.”

I am okay with the notion of having tools to make it easier to locate information germane to a specific query. I am definitely happy with tools which can illustrate connections via concept maps, link analysis, and similar outputs. I understand that lawyers want to type in a phrase like “Panama deal” and get a set of documents related to this term so the mass of data can be chopped down by sending, recipient, time, etc.

But setting up discovery as a separate operation from keyword or entity based search seems a bit forced to me. The write up spins its lawn mower blades over the TrendMD service. That’s fine, but there are a number of ways to explore scientific, technical, and medical literature. Some are or were delightful like Grateful Med; others are less well known; for example, Mednar and Quertle.

Discovery means one thing to lawyers. It means another thing to me: A search add on.

Stephen E Arnold, November 11, 2016

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta