Tor Anonymity Not 100 Percent Guaranteed

January 1, 2017

An article at Naked Security reveals some information turned up by innovative Tor-exploring hidden services in its article, “‘Honey Onions’ Probe the Dark Web: At Least 3% of Tor Nodes are Rogues.” By “rogues,” writer Paul Ducklin is referring to sites, run by criminals and law-enforcement alike, that are able to track users through Tor entry and/or exit nodes. The article nicely lays out how this small fraction of sites can capture IP addresses, so see the article for that explanation. As Ducklin notes, three percent is a small enough window that someone just wishing to avoid having their shopping research tracked may remain unconcerned, but is a bigger matter for, say, a journalist investigating events in a war-torn nation. He writes:

Two researchers from Northeastern University in Boston, Massachussets, recently tried to measure just how many rogue HSDir nodes there might be, out of the 3000 or more scattered around the world. Detecting that there are rogue nodes is fairly easy: publish a hidden service, tell no one about it except a minimum set of HSDir nodes, and wait for web requests to come in.[…]

With 1500 specially-created hidden services, amusingly called ‘Honey Onions,’ or just Honions, deployed over about two months, the researchers measured 40,000 requests that they assume came from one or more rogue nodes. (Only HSDir nodes ever knew the name of each Honion, so the researchers could assume that all connections must have been initiated by a rogue node.) Thanks to some clever mathematics about who knew what about which Honions at what time, they calculated that these rogue requests came from at least 110 different HSDir nodes in the Tor network.

It is worth noting that many of those requests were simple pings, but others were actively seeking vulnerabilities. So, if you are doing anything more sensitive than comparing furniture prices, you’ll have to decide whether you want to take that three percent risk. Ducklin concludes by recommending added security measures for anyone concerned.

Cynthia Murrell, January 1, 2017

Connexica (Formerly Ardentia NetSearch) Embraces Business Analytics

December 31, 2016

You may remember Ardentia NetSearch. The company’s original product was NetSearch, which was designed to be quick to deploy and designed for the end use, not the information technology department. The company changed its name to Connexica in 2001. I checked the company’s Web site and noted that the company positions itself this way:

Our mission is to turn smart data discovery into actionable information for everyone.

What’s interesting is that Connexica asserts that

“search engine technology is the simplest and fastest way for users to service their own information needs.”

The idea is that if one can use Google, one can use Connexica’s systems. A brief description of the company states:

Connexica is the world’s pioneer of search based analytics.

The company offers Cxair. This is a Java based Web application. The application provides search engine based data discovery. The idea is that Cxair permits “fast, effective and agile business analytics.” What struck me was the assertion that Cxair is usable with “poor quality data.” The idea is to create reports without having to know the formal query syntax of SQL.

The company’s MetaVision produce is a Java based Web application that “interrogates database metadata.” The idea, as I understand it, is to use MetaVision to help migrate data into Hadoop, Cxair, or ElasticSearch.

Connexica, partly funded by Midven, is a privately held company based in the UK. The firm has more than 200 customers and more than 30 employees. When updating my files, I noted that Zoominfo reports that the firm was founded in 2006, but that conflicts with my file data which pegs the company operating as early as 2001.

A quick review of the company’s information on its Web site and open sources suggests that the firm is focusing its sales and marketing efforts on health care, finance, and government customers.

Connexica is another search vendor which has performed a successful pivot. Search technology is secondary to the company’s other applications.

Stephen E Arnold, December 31, 2016

Study of Search: Weird Results Plus Bonus Errors

December 30, 2016

I was able to snag a copy of “Indexing and Search: A Peek into What Real Users Think.” The study appeared in October 2016, and it appears to be the work of IT Central Station, which is an outfit described as a source of “unbiased reviews from the tech community.” I thought, “Oh, oh, “real users.” A survey. An IDC type or Gartner type sample which although suspicious to me seems to convey some useful information when the moon is huge. Nope. Nope.Unbiased. Nope.

Note that the report is free. One can argue that free does not translate to accurate, high value, somewhat useful information. I support this argument.

The report, like many of the “real” reports I have reviewed over the decades is relatively harmless. In terms of today’s content payloads, the study fires blanks. Let’s take a look at some of the results, and you can work through the 16 pages to double check my critique.

First, who are the “top” vendors? This list reads quite a bit about the basic flaw in the “peek.” The table below presents the list of “top” vendors along with my comment about each vendor. Companies with open source Lucene/Solr based systems are in dark red. Companies or brands which have retired from the playing field in professional search are in bold gray.

Vendor Comment
Apache This is not a search system. It is an open source umbrella for projects of which Lucene and Solr are two projects among many.
Attivio Based on Lucene/Solr open source search software; positioned as a business intelligence vendor
Copernic A desktop search and research system based on proprietary technology from the outfit known as Coveo
Coveo A vendor of proprietary search technology now chasing Big Data and customer support
Dassault Systèmes Owns Exalead which is now downgraded to a utility with Dassault’s PLM software
Data Design, now Ryft.com Pitches search without indexing via propriety “circuit module” method
Data Gravity Search is a utility in a storage centric system
DieselPoint Company has been “quiet” for a number of years
Expert System Publicly traded and revenue challenged vendor of a metadata utility, not a search system
Fabasoft Mindbreeze is a proprietary replacement for SharePoint search
Google Discontinued the Google Search Appliance and exited enterprise search
Hewlett Packard Enterprise Sold its search technology to Micro Focus; legal dispute in progress over alleged fraud
IBM Ominifind Lucene and proprietary scripts plus acquired technology
IBM StoredIQ Like DB2 search, a proprietary utility
ISYS Search Software Now owned by Lexmark and marginalized due to alleged revenue shortfalls
Lookeen Lucene based desktop and Outlook search
Lucidworks Solr add ons with floundering to be more than enterprise search
MAANA Proprietary search optimized for Big Data
Microsoft Offers multiple search solutions. The most notorious are Bing and Fast Search & Transfer proprietary solutions
Oracle Full text search is a utility for Oracle licenses; owns Artificial Linguistics, Triple Hop, Endeca, RightNow, InQuira, and the marginalized Secure Enterprise Search. Oh, don’t forget command line querying via PL/SQL
Polyspot, now CustomerMatrix Now a customer service vendor
Siderean Software Went out of business in 2008; a semantic search outfit
Sinequa Now a Big Data outfit with hopes of becoming the “next big thing” in whatever sells
X1 Search An eternal start up pitching eDiscovery and desktop search with a wild and crazy interface

What’s the table tell us about “top” systems? First, the list includes vendors not directly in the search and retrieval business. There is no differentiation among the vendors repackaging and reselling open source Lucene/Solr solutions. The listing is a fruit cake of desktop, database, and unstructured search systems. In short, the word “top” does not do the trick for me. I prefer “a list of eclectic and mostly unknown systems which include a search function.”

The report presents 10 bar charts which tell me absolutely nothing about search and retrieval. The bars appear to be a popularity content based on visits to the author’s Web site. Only two of the search systems listed in the bar chart have “reviews.” Autonomy IDOL garnered three reviews and Lookeen one review. The other eight vendors’ products were not reviewed. Autonomy and Lookeen could not be more different in purpose, design, and features.

The report then tackles the “top five” search systems in terms of clicks on the author’s Web site. Yep, clicks. That’s a heck of a yardstick because what percentage of clicks were humans and what percentage was bot driven? No answer, of course.

The most popular “solutions” illustrate the weirdness of the sample. The number one solution is DataGravity, which is a data management system with various features and utilities. The next four “top” solutions are:

  • Oracle Endeca – eCommerce and business intelligence and whatever Oracle can use the ageing system for
  • The Google Search Appliance – discontinued with a cloud solution coming down the pike, sort of
  • Lucene – open source, the engine behind Elasticsearch, which is quite remarkably not on the list of vendors
  • Microsoft Fast Search – included in SharePoint to the delight of the integrators who charge to make the dog heel once in a while.

I find it fascinating that DataGravity (1,273) garnered almost 4X the “votes” as Microsoft Fast Search (404). I think there are more than 200 million plus SharePoint licensees. Many of these outfits have many questions about Fast Search. I would hazard a guess that DataGravity has a tiny fraction of the SharePoint installed base and its brand identity and company name recognition are a fraction of Microsoft’s. Weird data or meaningless.

The bulk of the report are comparison of various search engines. I could not figure out the logic of the comparisons. What, for example, do Lookeen and IBM StoredIQ have in common? Answer: Zero.

The search report strikes me as a bit of silliness. The report may be an anti sales document. But your mileage will differ. If it does, good luck to you.

Stephen E Arnold, December 30, 2016

DataFission: Is It a Dusie?

December 26, 2016

I know that some millennials are not familiar with the Duesenberg automobile. Why would that generation care about an automobile manufacturer that went out of business in 1937. My thought is that the Duesenberg left one nifty artifact: The word doozy which means something outstanding.

Image result for duesenberg

I thought of the Duesenberg “doozy” when I read “Unstructured Data Search Engine Has Roots in HPC.” HPC means high performance computing. The acronym suggests a massively parallel system just like the one to which the average mobile phone user has access. The name of the search engine is “Duse,” which here in Harrod’s Creek is pronounced “doozy.”

According to the write up:

One company hoping to tap into the morass of unstructured data is DataFission. The San Jose, California firm was founded in 2013 with the goal of productizing a scale-out search engine , called the Digital Universe Search Engine, or DUSE, that it claims can index just about any piece of data, and make it searchable from any Web-enabled device.

The key to Duse is pattern matching. This is a pretty good method; for example, Brainware used trigrams to power its search system. Since the company disappeared into Lexmark, I am not sure what happened to the company’s system. I think the n-gram patent is owned by a bank located near an abandoned Kodak facility.

The method of the system, as I understand it, is:

  1. Index content
  2. Put index into compressed tables
  3. Allow users to search the index.

The users can “search” by entering queries or dragging “images, videos, or audio files into Duse’s search bar or programmatically via REST APIs.”

What differentiates Duse? The write up states:

The secret sauce lies in how the company indexes the data. A combination of machine learning techniques, such as principal component analysis (PCA), clustering, and classification algorithms, as well as graph link analysis and “nearest neighbor” approach  help to find associations in the data.

Dr. Harold Trease, the architect of the Duse system, says:

We generate a high-dimensional signature, a high-dimensional feature vector, that quantifies the information content of the data that we read through,” he says. “We’re not looking for features like dogs or cats or buildings or cars. We’re quantifying the information content related to the data that we read. That’s what we index and put in a database. Then if you pull out a cell phone and take a picture of the dog, we convert that to one of these high-dimensional signatures, and then we compare that to what’s in the database and we find the best matches.

He adds:

If we index a billion images, we’d end up with a billion points in this search space, and we can look at that search space it has structure to it, and the structure is fantastic. There’s all kinds these points and clusters and strands that connect things. It makes little less sense to humans, because we don’t see things like that. But to the code, it makes perfect sense.

The company’s technology dates from the 1990s and the search technology was part of the company’s medical image analysis and related research.

The write up reports:

The software itself, which today exists as a Python-based Apache Spark application, can be obtained as software product or fully configured on a hardware appliance called DataHunter.

For more information about the company, navigate to this link.

Stephen E Arnold, December 26, 2016

Lucidworks Sees Watson as a Savior

December 21, 2016

Lucidworks (really?). A vision has appeared to the senior managers of Lucidworks, an open source search outfit which has ingested $53 million and sucked in another $6 million in debt financing in June 2016. Yep, that Lucidworks. The “really” which the name invokes is an association I form when someone tells me that commercializing open source search is going to knock off the pesky Elastic of Elasticsearch fame while returning a juicy payoff to the folks who coughed up the funds to keep the company founded in 2007 chugging along. Yep, Lucid works. Sort of, maybe.

I read “Lucidworks Integrates IBM Watson into Fusion Enterprise Discovery Platform.” The write up explains that Lucidworks is “tapping into” the IBM Watson developer cloud. The write up explains that Lucidworks has:

an application framework that helps developers to create enterprise discovery applications so companies can understand their data and take action on insights.

Ah, so many buzzwords. Search has become applications. “Action on insights” puts some metaphorical meat on the bones of Solr, the marrow of Lucidworks. Really?

With Watson in the company’s back pocket, Lucidworks will deliver. I learned:

Customers can rely on Fusion to develop and deploy powerful discovery apps quickly thanks to its advanced cognitive computing features and machine learning from Watson. Fusion applies Watson’s machine learning capabilities to an organization’s unique and proprietary mix of structured and unstructured data so each app gets smarter over time by learning to deliver better answers to users with each query. Fusion also integrates several Watson services such as Retrieve and Rank, Speech to Text, Natural Language Classifier, and AlchemyLanguage to bolster the platform’s performance by making it easier to interact naturally with the platform and improving the relevance of query results for enterprise users.

But wait. Doesn’t Watson perform these functions already. And if Watson comes up a bit short in one area, isn’t IBM-infused Yippy ready to take up the slack?

That question is not addressed in the write up. It seems that the difference between Watson, its current collection of partners, and affiliated entities like Yippy are vast. The write up tells me:

customers looking for hosted, pre-tuned machine learning and natural language processing capabilities can point and click their way to building sophisticated applications without the need for additional resources. By bringing Watson’s cognitive computing technology to the world of enterprise data apps, these discovery apps made with Fusion are helping professionals understand the mountain of data they work with in context to take action.

This sounds like quite a bit of integration work. Lucidworks. Really?

Stephen E Arnold, December 21, 2016

Creativity for Search Vendors

December 18, 2016

If you scan the marketing collateral from now defunct search giants like Convera, DR LINK, Fulcrum Technologies or similar extinct beasties, you will notice a similarity of features and functions. Let’s face it. Search and retrieval has been stuck in the mud for decades. Some wizards point to the revolution of voice search, emoji based queries, and smart software which knows what you want before you know you need some information.

Typing key words, indexing systems which add concept labels, and shouting at a mobile phone whilst standing between cars on a speeding train returns semi-useful links to what amount to homework: Open link, scan for needed info, close link, and do it again.

Image result for eureka california

Eureka, California is easy to find. Get inspired.

Now there is a solution to search and content processing vendors’ inability to be creative. These methods appear to fuel the fanciful flights of fancy emanating from predictive analytics, Big Data, and semantic search companies.

Navigate to “8 Tried-and-Tested Ways to Unlock Your Creativity.” Now you too can emulate the breakthroughs, insights, and juxtapositions of Leonardo, Einstein, Mozart, and, of course, Facebook’s design team.

Let’s take a look at these 10 ideas.

  1. Set up a moodboard. I have zero idea what a moodboard is. I am not sure it would fit into the work methods of Beethoven. He seemed a bit volatile and prone to “bad” moods.
  2. Talk it out. That’s a great idea for companies engaged in classified projects for nation states. Why not have those conversations in a coffee shop or better yet on an airplane with strangers sitting cheek by jowl.
  3. Brainstorming. My recollectioin of brainstorming is that it can be fun, but without one person who doesn’t get with the program, the “ideas” are often like recycled plastic bottles. Not always, of course. But the donuts can be a motivator.
  4. Mindmapping. Yep, diagrams. These are helpful, particularly when equations are included for the home economics and failed webmasters who wrangle a job at a search or cotnent processing vendor. What’s that pitchfork looking thing mean?
  5. Doodling. Works great. The use of paper and pencils is popular. One can use a Microsoft Surface or a giant iPad thing. Profilers and psychologists enjoy doodles. Venture capitalists who invested in a search and content processing company often sketch some what dark images.
  6. Music. Forget that Mozart and fighter pilot stuff. Go for Gregorian chants, heavy metal, and mindfulness tunes. Here in Harrod’s Creek, we love Muzak featuring the Whites and John Lomax.
  7. Lucid dreaming. This idea is popular among some of the visionaries working at high profile Sillycon Valley companies. Loon balloons, solar powered Internet aircraft, and trips to Mars. Apply that thinking to search and what do you get? Tay, search by sketch, and smart maps which identify pizza joints.
  8. Imagine what a great innovator would do. That works. People sitting on a sofa playing a video game can innovate between button pushes.

Why are search and cotnent processing vendors more creative? Now these folks can go in new directions armed with these tips and the same eight or nine algorithms in wide use. Peak search? Not by a country mile.

Stephen E Arnold, December 18, 2016

Use Google on Itself to Search Your Personal Gmail Account

December 16, 2016

The article titled 9 Secret Google Search Tricks on Field Guide includes a shortcut to checking on your current and recent deliveries, your flight plans, and your hotels. Google provides this information by pulling keywords from your Gmail account inbox. Perhaps the best one for convenience is searching “my bills” and being reminded of upcoming payments. Of course, this won’t work for bills that you receive via snail mail. The article explains,

Google is your portal to everything out there on the World Wide Web…but also your portal to more and more of your personal stuff, from the location of your phone to the location of your Amazon delivery. If you’re signed into the Google search page, and you use other Google services, here are nine search tricks worth knowing. It probably goes without saying but just in case: only you can see these results.

Yes, search is getting easier. Trust Mother Google. She will hold all your information in her hand and you just need to ask for it. Other tricks include searching “I’ve lost my phone.” Google might not be Find My Iphone, but it can tell you the last place you had your phone, given that you phone was linked to your Google account. Hotels, Events, Photos, Google will have your back.

Chelsea Kerwin, December 16, 2016

Big Data Needs to Go Public

December 16, 2016

Big Data touches every part of our lives and we are unaware.  Have you ever noticed when you listen to the news, read an article, or watch a YouTube video that people say items such as: “experts claim, “science says,” etc.”  In the past, these statements relied on less than trustworthy sources, but now they can use Big Data to back up their claims.  However, popular opinion and puff pieces still need to back up their big data with hard fact.  Nature.com says that transparency is a big deal for Big Data and algorithm designers need to work on it in the article, “More Accountability For Big-Data Algorithms.”

One of the hopes is that big data will be used to bridge the divide between one bias and another, except that he opposite can happen.  In other words, Big Data algorithms can be designed with a bias:

There are many sources of bias in algorithms. One is the hard-coding of rules and use of data sets that already reflect common societal spin. Put bias in and get bias out. Spurious or dubious correlations are another pitfall. A widely cited example is the way in which hiring algorithms can give a person with a longer commute time a negative score, because data suggest that long commutes correlate with high staff turnover.

Even worse is that people and organizations can design an algorithm to support science or facts they want to pass off as the truth.  There is a growing demand for “algorithm accountability,” mostly in academia.  The demands are that data sets fed into the algorithms are made public.  There also plans to make algorithms that monitor algorithms for bias.

Big Data is here to say, but relying too much on algorithms can distort the facts.  This is why the human element is still needed to distinguish between fact and fiction.  Minority Report is closer to being our present than ever before.

Whitney Grace, December 16, 2016

Costs of the Cloud

December 15, 2016

The cloud was supposed to save organizations a bundle on servers, but now we learn from Datamation that “Enterprises Struggle with Managing Cloud Costs.” The article cites a recent report from Dimensional Research and cloud-financial-management firm Cloud Cruiser, which tells us, for one thing, that 92 percent of organizations surveyed now use the cloud. Researchers polled 189 IT pros at Amazon Web Services (AWS) Global Summit in Chicago this past April, where they also found that 95 percent of respondents expect their cloud usage to expand over the next year.

However, organizations may wish to pause and reconsider their approach before throwing more money at cloud systems. Writer Pedro Hernandez reports:

Most organizations are suffering from a massive blind spot when it comes to budgeting for their public cloud services and making certain they are getting their money’s worth. Nearly a third of respondents said that they aren’t proactively managing cloud spend and usage, the study found. A whopping 82 percent said they encountered difficulties reconciling bills for cloud services with their finance departments.

The top challenge with the continuously growing public cloud resource is the ability to manage allocation usage and costs,’ stated the report. ‘IT and Finance continue to have difficulty working together to ascertain and allocate public cloud usage, and IT continues to struggle with technologies that will gather and track public cloud usage information.’ …

David Gehringer, principal at Dimensional Research, believes it’s time for enterprises to quit treating the cloud differently and adopt IT monitoring and cost-control measures similar to those used in their own data centers.

The report also found that top priorities for respondents included cost and reporting at 54 percent, performance management at 46 percent, and resource optimization at 45 percent. It also found that cloudy demand is driven by application development and testing, at 59 percent, and big data/ analytics at 31 percent.

The cloud is no longer a shiny new invention, but rather an integral part of most organizations. We would do well to approach its management and funding as we would other resource. The original report is available, with registration, here.

Cynthia Murrell, December 15, 2016

On the Hunt for Thesauri

December 15, 2016

How do you create a taxonomy? These curated lists do not just write themselves, although they seem to do that these days.  Companies that specialize in file management and organization develop taxonomies.  Usually they offer customers an out-of-the-box option that can be individualized with additional words, categories, etc.  Taxonomies can be generalized lists, think of a one size fits all deal.  Certain industries, however, need specialized taxonomies that include words, phrases, and other jargon particular to that field.  Similar to the generalized taxonomies, there are canned industry specific taxonomies, except the more specialized the industry the less likely there is a canned list.

This is where the taxonomy lists needed to be created from scratch.  Where do the taxonomy writers get the content for their lists?  They turn to the tried, true resources that have aided researchers for generations: dictionaries, encyclopedias, technical manuals, and thesauri are perhaps one of the most important tools for taxonomy writers, because they include not only words and their meanings, but also synonyms and antonyms words within a field.

If you need to write a taxonomy and are at a lost, check out MultiTes.  It is a Web site that includes tools and other resources to get your taxonomy job done.  Multisystems built MultiTes and they:

…developed our first computer program for Thesaurus Management on PC’s in 1983, using dBase II under CPM, predecessor of the DOS operating system.  Today, more than three decades later, our products are as easy to install and use. In addition, with MultiTes Online all that is needed is a web connected device with a modern web browser.

In other words, they have experience and know their taxonomies.

Whitney Grace, December 15, 2016

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta