A Full Text Engine Blooms in Life
January 9, 2014
Basic search for static Web sites stink. They are just a generic code that takes a one-size fits all approach to search and as we all know that never works. Stavros Korokithakis realized this problem and decided that he wanted to create a full-text search engine that was accurate. In his article, “Writing A Full-Text Search Engine Using Bloom Filters,” Korokithakis details how he wrote his own search using an inverted index and bloom filters. An inverted index works by mapping every word in a document to the ID of the document. As one can imagine that list grows very big and the basic search engine for a static Web site returns every hit. A search plug-in limits itself to titles, tags, and key words. How do you get the same results for a static search?
A bloom filter is the answer. A bloom filter is a data structure that stores elements in a fixed number of bits and tells users whether it has seen those elements before queried. It is also apparently easy to implement a bloom filter:
- “Create one filter per document and add all the words in that document in the filter.
- Serialize the (fixed-size) filter in some sort of string and send it to the client.
- When the client needs to search, iterate through all the filters, looking for ones that match all the terms, and return the document names.
- Profit!”
He even has a quick implementation guide in Python. It sounds like a wonderful way to improve static Web site search, but could not the same problem be solved with a simple plug-in as described above? With the rampant use of people relying pre-made Web site servers such as Word Press, tumblr, etc. they come with built-in plug-ins. Is this for the bigger Web sites people deploy?
Whitney Grace, January 09, 2014
Sponsored by ArnoldIT.com, developer of Augmentext
Google Head of Open Source Opens Up
January 9, 2014
Google and open source have worked together since the search engine’s inception and it has contributed to its success. Tech Radar hosts an interview with Google’s head of open source Chris DiBona about how Google uses open source, how it has shaped the company, and how Google has changed the face of open source: “How Open Source Changed Google-And How Google Changed Open Source.”
DiBona explains that the open source sector of Google started small, but then rapidly expanded and from the moment he came on board the open source division had been working on over 3,700 projects. Even though one might think he managed the open source compliance part of Android, he does not. DiBona and his team contribute to the Android operating system by keeping it in compliance and help keep it at least three years ahead of the current release. For each project, Google’s approach to open source changes. Android and Chrome are totally different when it comes to compliance. DiBona spends a large portion of his time keeping different projects in compliance, especially when they are competitive.
He even alludes the philosophical difference between the two:
“It’s funny because people say ‘Oh, it’s just software, you shouldn’t worry about it’. Or ‘It’s just business, you shouldn’t worry about it’. But what people seem to forget is that software and business are personal. It’s how we get through our day. It’s an important part of our lives so trying to keep things in perspective is really important. Now, you could say ‘Does that make you a sellout Chris?’ But I don’t feel it does because given that the overall actions of the company have been, in my opinion, really strong and on the side of the angels, I think it’s OK for us to have these discussions, especially internally.”
DiBona is proud of what Google contributes to open source, claiming that Go, Chromium, and Android are its best additions to it. Even without open source DiBona thinks that Google would exist, except the Web and Linux would not be around at all. Then Google would not be around either—so it would not exist. DiBona’s interview cements the importance of open source to Google and vice versa. Google is a big company and supports the open source development. As always, setting a standard for other companies.
Whitney Grace, January 09, 2014
Sponsored by ArnoldIT.com, developer of Augmentext
Datameer Rakes in Funding
January 9, 2014
Right now, Datameer is happily positioned at the intersection of preparation and opportunity, we learn from “Datameer Picks Up $19M to Help Companies Do Analytics Along with Hadoop” at VentureBeat. The use of Hadoop has been soaring, and Datameer is perfectly poised to rise with it. As more companies implement the open-source database framework, Datameer is seeing more demand for its help making sense of it all. It doesn’t hurt that the data-analysis firm built its solutions with Hadoop in mind from the start—any IT professional knows that can mean the difference between headache-free implementation and long hours trying to force applications to play well together.
Investors have taken notice of Datameer’s advantages. Writer Jordan Novet relates:
“‘You’re actually seeing Datameer being purchased almost at the same time as Hadoop itself, at the same time as the distribution,’ Ben Fu, a partner at Next World Capital, said in an interview with VentureBeat. Next World led the latest round of funding for the company, bringing its total funding to $36.8 million. Datameer’s large contracts from customers such as British Telecom, Sears, and Visa, also made the company interesting, Fu said….
Next World Capital’s Fu is joining Datameer’s board. Alongside Next World, Kleiner Perkins Caufield & Byers and Redpoint Ventures also joined the round. The new money will provide Datameer with the firepower to sign up new customers, especially in Europe, where Next World has a program to put startups in touch with executives at enterprises from around the continent.”
Novet notes the funding can also allow Datameer to take advantage of further Hadoop advances, as well as respond to competition. Datameer was begun in 2009 by some of the original Hadoop contributors. Headquartered in San Mateo, California, the company also has offices in New York City and in Halle, Germany. In related and possibly helpful news, Datameer is hiring for several positions as of this writing.
Cynthia Murrell, January 09, 2014
Sponsored by ArnoldIT.com, developer of Augmentext
SharePoints Top 20 Hits of 2013
January 9, 2014
Last year was a big year for SharePoint, coming off of the release of SharePoint 2013. CMSWire devoted a lot of virtual print space to SharePoint, and is now offering a year in review in their article, “CMSWire’s Top 20 Hits of 2013: SharePoint.”
The article begins:
“SharePoint was one of the topics that attracted a lot of interest in the past year — and just as much controversy. It seems everyone has a view on it and how it should be used. However, there were three big subjects that dominated, and make up the lion’s share of our Top 20 this year: 1) SharePoint Online 2) SharePoint and Yammer and 3) SharePoint in Office 365.”
Stephen E. Arnold is a longtime leader and expert in all things search, including enterprise. His information service, ArnoldIT.com, devotes a lot of attention to SharePoint. He has also found that users are interested in online deployments and social applications for SharePoint. The last year was a busy one for SharePoint, and it will be interesting to see where 2014 goes as the newness of SharePoint 2013 wears off.
Emily Rae Aldridge, January 9, 2014
IBM Wrestling with Watson
January 8, 2014
“IBM Struggles to turn Watson into Big Business” warrants a USA Today treatment. You can find the story in the hard copy of the newspaper on page A 1 and A 2. I saw a link to the item online at http://on.wsj.com/1iShfOG but you may have to pay to read it or chase down a Penguin friendly instance of the article.
The main point is that IBM targeted $10 billion in Watson revenue by 2023. Watson has generated less than $100 million in revenue I presume since the system “won” the Jeopardy game show.
The Wall Street Journal article is interesting because it contains a number of semantic signals, for example:
- The use of the phrase “in a ditch” in reference to a a project at the University of Texas M.D. Anderson Cancer Center
- The statement “Watson is having more trouble solving real-life problems”
- The revelation that “Watson doesn’t work with standard hardware”
- An allegedly accurate quote from a client that says “Watson initially took too long to learn”
- The assertion that “IBM reworked Watson’s training regimen”
- The sprinkling of “could’s” and “if’s”
I came away from the story with a sense of déjà vu. I realized that over the last 25 years I have heard similar information about other “smart” search systems. The themes run through time the way a bituminous coal seam threads through the crust of the earth. When one of these seams catches fire, there are few inexpensive and quick ways to put out the fire. Applied to Watson, my hunch is that the cost of getting Watson to generate $10 billion in revenue is going to be a very big number.
The Wall Street Journal story references the need for humans to learn and then to train Watson about the topic. When Watson goes off track, more humans have to correct Watson. I want to point out that training a smart system on a specific corpus of content is tricky. Algorithms can be quite sensitive to small errors in initial settings. Over time, the algorithms do their thing and wander. This translates to humans who have to monitor the smart system to make sure it does not output information in which it has generated confidence scores that are wrong or undifferentiated. The Wall Street Journal nudges this state of affairs in this passage:
In a recent visit to his [a Sloan Kettering oncologist] pulled out an iPad and showed a screen from Watson that listed three potential treatments. Watson was less than 32% confident that any of them were [sic] correct.
Then the Wall Street Journal reported that tweaking Watson was tough, saying:
The project initially ran awry because IBM’s engineers and Anderson’s doctors didn’t understand each other.
No surprise, but the fix just adds to the costs of the system. The article revealed:
IBM developers now meet with doctors several times a week.
Why is this Watson write up intriguing to me? There are four reasons:
First, the Wall Street Journal makes clear that dreams about dollars from search and content processing are easy to inflate and tough to deliver. Most search vendors and their stakeholders discover the difference between marketing hyperbole and reality.
Second, the Watson system is essentially dependent on human involvement. The objective of certain types of smart software is to reduce the need for human involvement. Watching Star Trek and Spock is not the same as delivering advanced systems that work and are affordable.
Third, the revenue generated by Watson is actually pretty good. Endeca hit $100 million between 1998 and 2011 when it was acquired by Oracle. Autonomy achieved $800 million between 1996 and 2011 when it was purchased by Hewlett Packard. Watson has been available for a couple of years. The problem is that the goal is, it appears, out of reach even for a company with IBM’s need for a hot new product and the resources to sell almost anything to large organizations.
Fourth, Watson is walking down the same path that STAIRS III, an early IBM search system, followed. IBM embraced open source to help reduce the cost of delivering basic search. Now IBM is finding that the value-adds are more difficult than key word matching and Boolean centric information retrieval. When a company does not learn from its own prior experiences in content processing, the voyage of discovery becomes more risky.
Net net: IBM has its hands full. I am confident that an azure chip consultant and a couple of 20 somethings can fix up Watson in a nonce. But if remediation is not possible, IBM may vie with Hewlett Packard as the pre-eminent example of the perils of the search and content processing business.
Stephen E Arnold, January 8, 2014
Latest PhraseExpress Packed with Updates
January 8, 2014
When searching, chatting, or writing an important email, do you ever wish you had an auto-complete function for phrases? PhraseExpress 10 is the latest iteration of such a tool, and this version has many new features. We get the rundown from betanews‘ “PhraseExpress 10 Debuts Phrase Searches, Outlook Add-In, Input Validation.” For example, users can search within a popup window for a desired phrase, even if the source is in a subfolder.
The article also specifies:
“Display performance is up to ten times faster; bread crumb navigation and scroll wheel support improves navigation; an option to highlight key phrases in color ensures they’re easy to spot; and new customization options mean you can tweak the menu font, size, and more.
Formatted text has been extended with the option to add interactive WYSIWYG formats, including input fields, dropdown menus, date pickers and checkboxes. User input can be validated to reduce the chance of errors, and macros now work in formatted phrases, too.
The new PhraseExpress Enterprise edition includes an Outlook add-in which analyses incoming mails, then offers intelligent auto-complete suggestions based on their context. At its simplest this might just automatically greet the sender with the name used to sign off their email, but it can also provide tailored responses to particular keywords (a product name, say).”
Writer Mike Williams goes on to note improvements to the data-import and phrase-creation processes. There are also SQL Server support and simplified licensing. First released in 2002, PhraseExpress is a product of Bartels Media. Launched in 1997, the small company makes its home in Trier, Germany.
Cynthia Murrell, January 08, 2014
Sponsored by ArnoldIT.com, developer of Augmentext
Prediction for IBM January Announcement
January 8, 2014
Folks at the Register have been reading the signs and believe they know what big announcement IBM plans to make at its Infrastructure Matters virtual event on January 14th. “IBM Flashy January Announcement: Wanna Know What’s in It?” predicts the launch of data centers with all-flash memory. That would be one way to combat storage latency. What makes writer Chris Mellor so sure? Several clues led to the prediction.
First, Mellor points to an SEC filing in which Netlist is sewing Diablo and SMART Storage for allegedly using its DIMM tech. The filing revealed that IBM is planning to introduce ULLtraDIMM to the market in one of its X-Series servers in January. If the filing is accurate, that would add around a terabyte of flash alongside a server’s main memory, cutting access time in half.
Then there’s a blog post from Woody Hutsell, who found himself working in IBM’s FlashSystems division after Big Blue bought up his former employer, Texas Memory Systems. According to Hutsell’s post, IBM plans to introduce flash arrays to cover both high-end and efficiency markets; the post hints at a connection to the January announcement.
If those leads aren’t enough, a simple look at the speakers IBM has lined up convinced Mellor that his suspicions are correct. He writes:
“Looking at the three featured speakers at the event we see:
*Adalio Sanchez, General Manager, System x
*Alex Yost, VP, IBM PureFlex, System x and Bladecenter
*Michael Kuhn, Vice President and Business Line Executive, Flash Systems.
“With these three speakers highlighted, your Vulture thinks we are going to be told about IBM X-Series servers fitted with ULLtraDIMMs, other servers fitted with FlashSystem PCIe Flash Adapters, and a new line of Flash System arrays that will feature in-line deduplication and other data management functionality. They will be IBM’s response to Pure System’s arrays, EMC’s XtremIo, NetApp’s EF550 and coming FlashRay, and Violin Memory’s 6000 and 3000-series products.
“The Flash Adapters are IBM’s response to Fusion-io and the many other PCIe flash card vendors such as LSI, Micron and Violin.
“The ULLtraDIMM X-Series servers will be an industry-first and give IBM an edge over Cisco, Dell and HP in the server game.”
So, is IBM ready to move into the all-flash realm, or is our “Vulture” on the wrong trail? We will find out soon.
Cynthia Murrell, January 08, 2014
Sponsored by ArnoldIT.com, developer of Augmentext
Link Mischief at Feedly
January 8, 2014
Here is some content excitement of interest to journalists and bloggers everywhere. MakeUseOf informs us that “Feedly Was Stealing Your Content—Here’s the Story, and Their Code.” Apparently, the aggregation site was directing shared links to copies on their own site instead of to original articles, essentially stealing traffic. Writer James Bruce, eager to delve deeper into the code, makes it clear that he is following up on a discovery originally revealed by The Digital Reader.
For example, the article notes that Feedly is now sending links to the proper sites, but by way of JavaScript code instead of in the usual, server-level way. Bruce also noticed that, in its attempt to improve functionality, Feedly was stripping embedded items from content. Advertising, tracking, share buttons, even “donate” buttons—gone.
Bruce writes:
“Not only were Feedly scraping the content from your site, they were then stripping any original social buttons and rewriting the meta-data. This means that when someone subsequently shared the item, they would in fact be sharing the Feedly link and not the original post. Anyone clicking on that link would go straight to Feedly.
So what, you might ask? When a post goes viral, it can be of huge benefit to the site in question — raising page views and ad revenues, and expanding their audience. Feedly was outright stealing that specific benefit away from the site to expand its own user base. The Feedly code included checks for mobile devices that would direct the users to the relevant appstore page.
It wasn’t ‘just making the article easier to view’ — it was stealing traffic, plain and simple. That’s really not cool.”
The write-up goes on to detail the ways Feedly has responded to discoveries, where the issue stands now, and “what we have learnt”: Feedly made some bad choices in the pursuit of a streamlined reading experience. As a parting shot, Bruce cites another example of a bad call by the company—it briefly required a Google+ account to log in. He has a point there.
Cynthia Murrell, January 08, 2014
Sponsored by ArnoldIT.com, developer of Augmentext
SharePoint Consulting Services Ranked
January 8, 2014
As SharePoint deployments get more and more involved and customized, many organizations are turning to SharePoint consultants to help launch or refresh implementations. In light of the trend, PR Web looks at the most successful SharePoint consulting firms in the article, “Ten Top SharePoint Consulting Services Issued in December 2013 by bestwebdesignagencies.com.”
The article says:
“The independent authority on web solutions, bestwebdesignagencies.com, has promoted the best SharePoint consulting firms in the mobile development industry for the month of December 2013 . . . The rankings are produced by the independent research team through painstaking testing and analysis to decide the best firms offering SharePoint consulting solutions. To view the ratings of the top SharePoint development services click here.”
Stephen E. Arnold is a longtime leader in search and frequently covers SharePoint on his information service, ArnoldIT.com. His coverage also points to an increasingly complicated enterprise environment, one that begs for outside expertise and consultation. Users who are in need of such services may find some assistance in the consulting services ranked by bestwebdesignagencies.com.
Emily Rae Aldridge, January 8, 2014
Autonomy: Accusations Fly from the US Air Force
January 7, 2014
The Washington Post story “Government Questioned MicroTech about Its Role in HP Fraud Allegations” puts search and content processing in the spotlight. The newspaper is digging into the interesting underbelly of US government contracting. (The full series is at http://wapo.st/19aZwPh.)
I am certain that there are many fascinating tales about the interactions of contractors, contract officers, politicians, and lobbyists. The Washington Post is hopping into the fray and not a minute too soon to probe activities somewhat less fresh than the Healthcare.gov project or a number of higher profile projects, including tanks that are orphans and fighters that are too slow, underarmed, and unable to outperform fighters from certain other countries.
In fact, I think the HP-Autonomy deal closed a couple of years ago and US government contracting has been chugging along in its present form for 40, 50 years. Perhaps the procurement processes will change so that contractors’ business practices can change accordingly.
I found this passage from the Post story interesting:
MicroTechnologies LLC is among two companies and six executives who are said to have taken part in the efforts to boost the revenues of software maker Autonomy before its sale to HP, according to documents prepared by the Air Force deputy general counsel’s office that raised the possibility of barring all the parties from receiving federal contracts.
The Post story was picked up by other “real” journalists, including the estimable Telegraph in the UK (See the British take in “Autonomy Founder Mike Lynch under Fire from US Air Force over HP Claims.”)
After working through the stories, I formed several hypotheses:
- Resellers bundled software, storage, and hardware for clients. The reason may be due to a desire to get an “appliance”, not a box of Lego blocks or to procure a system without having to go through the uncertain process of getting approval for a capital expenditure.
- The indirect sales model used by Autonomy with considerable success required Autonomy to pay money when the reseller picked up the phone and said, “We sold a big deal, and we need cash to move forward” or some variation on this theme that is well known to integrators and resellers.
- The business process in place provided payments to resellers because of the terms of a particular agreement with a reseller or class or partners. Autonomy purchased some resellers and integrators to respond to the challenges the indirect sales model posed to Autonomy since 1998.
- Some combination of factors was arbitrated by Autonomy’s financial team.
Autonomy purchased the Fast Search & Transfer government sales unit and that group may have imported some of Fast Search’s procedures.
With Dr. Michael Lynch inventing video technology like US8392450 and US 8488011 filed coincident with the HP closing, was he able to dive into reseller deals?
The fact is that Autonomy is now a unit of Hewlett Packard. What few pay attention to is another fact. HP was an Autonomy partner for a number of years prior to its purchase of Autonomy. HP was part of Autonomy’s indirect sales channel and presumably knows how procurements, sequesters, allocated funds, and the other arcana of US government procurement “works.”
Dr. Lynch did something no other search or content processing vendor serving the enterprise market was able to do. From the inception of Autonomy in 1996, he exhibited an uncanny knack for recognizing trends and incorporating solutions to information access problems on top of those trends. In the course of Autonomy’s trajectory from 1996 to 2011, Autonomy grew as a modern day roll up that generating almost $850 million in 2011.
I am supportive of a historical understanding of search and content processing. On one hand, Autonomy is now HP’s information processing prodigy. On the other hand, HP may not have the management or technical skills to build on Dr. Lynch’s work.
Oracle paid about $1 billion for Endeca, a system that dates from roughly the same era as Autonomy’s system. But HP paid $11 billion for Autonomy and discovered quickly that surviving and thriving in the odd universe of enterprise search and content processing is tough when the steering wheel is in its hands. Is Dr. Lynch on track when he suggests that his management team was more skilled than some realized?
Investigations into government contracting procedures are quite fascinating. I know from some of my past work that bureaucracies work in mysterious ways.
Perhaps some of these mysteries will be revealed? On the other hand, some of the mysteries may never be known. Where are the Golden Fleece awards today? Do bureaucracies have teeth? Do bureaucracies protect their own? Do special interests exert influence? These are difficult questions.
Maybe there will be answers in 2014? On the other hand, there may be more public relations, content marketing, and spin. I hope those involved with the matter dig into Bayes-Laplace methods, Shannon information theory, and Linear Weight Networks. The methods can help separate noise from high value information.
Stephen E Arnold, January 7, 2014