SwiftRiver: Open Source Pushes into the Intel Space
September 13, 2010
If you are one of the social netizens, you know it isn’t easy to keep track of, manage, and organize the hundreds of Twitter streams, Facebook updates, blog posts, RSS feeds, or SMS that you keep getting. Do not feel helpless as SwiftRiver comes to your aid, which is a free open source intelligence-gathering platform for managing real-time streams of data streams. This unique platform consists of a number of unique products and technologies, and its goal is to aggregate the information from multiple media channels, and add context related to it, using semantic analysis.
SwiftRiver can also be used as a search tool, for email filtering, to monitor numerous blogs, and verify real-time data from various channels. It offers, “Several advanced tools (social graph mining, natural language processing, locations servers, and twitter analytics) for free use via the open API platform Swift Web Services.” According to the parent site Swiftly.org, “This free tool is especially for organizations who need to sort their data by authority and accuracy, as opposed to popularity.” SwiftRiver has the ability to act quickly on massive amounts of data, a feat critical for emergency response groups, election monitors, media, and others.
There are multiple Swift Rivers. You want the one at http://swift.ushahidi.com or http://swiftly.org/.
Ushahidi, the company behind this initiative claims, “The SwiftRiver platform offers organizations an easy way to combine natural language/artificial intelligence process, data-mining for SMS and Twitter, and verification algorithms for different sources of information.” Elaborating further it states, “SwiftRiver is unique in that there is no singular ‘SwiftRiver’ application. Rather, there are many, that combine plug-ins, APIs, and themes in different ways that are optimized for workflows.”
Presently SwiftRiver uses the Sweeper App, the Kohana MVC UI, the distributed reputation system RiverID, and SwiftWebServices (SWS) as the API platform. The beauty here is that SwiftRiver is just the core, and it can have any UI, App, or API. It also has an intuitive and customizable dashboard, and the “users of WordPress and Drupal can add features like auto-tagging and more using Swift Web Services.” While you may download SwiftRiver and run it on your web server, SWS is a hosted cloud service, and does not need to be downloaded and installed.
Harleena Singh, September 13, 2010
Freebie
RSS Readers Dead? And What about the Info Flows?
September 13, 2010
Ask.com is an unlikely service to become a harbinger of change in content. Some folks don’t agree with this statement. For example, read “The Death Of The RSS Reader.” The main idea is that:
There have been predictions since at least 2006, when Pluck shut its RSS reader down that “consumer RSS readers” were a dead market, because, as ReadWriteWeb wrote then, they were “rapidly becoming commodities,” as RSS reading capabilities were integrated into other products like e-mail applications and browsers. And, indeed, a number of consumer-oriented RSS readers, including News Alloy, Rojo, and News Gator, shut down in recent years.
The reason is that users are turning to social services like Facebook and Twitter to keep up with what’s hot, important, newsy, and relevant.
An autumn forest. Death or respite before rebirth?
I don’t dispute that for many folks the RSS boom has had its sound dissipate. However, there are several factors operating that help me understand why the RSS reader has lost its appeal for most Web users. Our work suggest these factors are operating:
- RSS set up and management cause the same problems that the original Pointcast, Backweb, and Desktop Data created. There is too much for the average user to do and then too much on going maintenance required to keep the services useful.
- The RSS stream outputs a lot of baloney along with the occasional chunk of sirloin. We have coded our own system to manage information on the topics that interest the goose. Most folks don’t want this type of control. After some experience with RSS, my hunch is that many users find them too much work and just abandon them. End users and consumers are not too keen on doing repetitive work that keeps them from kicking back and playing Farmville or keeping track of their friends.
- The volume of information in itself is one part of the problem. The high value content moves around, so plugging into a blog today is guarantee that the content source will be consistent, on topic, or rich with information tomorrow. We have learned that lack of follow through by the creators of content creators is an issue. Publishers know how to make content. Dabblers don’t. The problem is that publishers can’t generate big money so their enthusiasm seems to come and go. Individuals are just individuals and a sick child can cause a blog writer to find better uses for any available time.
Instant Search: Who Is on First?
September 12, 2010
A reader sent me a link to “Back to the Future: Innovation Is Alive in Search.” The point of the write up is to make clear that for the author of the post, Yahoo was an innovator way back in 2005. In fact, if I understand the Yahooligan’s blog post, Yahoo “invented” instant search. I am an addled goose, but I recall seeing that function in a demo given me in 1999 or 2000 by a Fast Search & Transfer technology whiz. Doesn’t matter.
Search has lacked innovation for a long, long time. In fact, if you can track down someone who will share the closely guarded TREC results, you will see that precision and recall scores remain an interesting challenge for developers of information retrieval systems. In fact, the reason social curation seems to be “good enough” is that traditional search systems used to suck, still suck, and will continue to suck.
The problem is not the math, the wizards, and the hybrid Rube Goldberg machines that vendors use to work their magic. Nope. The problem with search has several parts. Let me make them explicit because the English majors who popular the azure chip consulting firms and “real” blogs have done their best to treat technology as John Donne poem:
First, language. Search involves language, which is a moving target. There’s a reason why poems and secret messages are tough to figure out. Words can be used in ways that allow some to “get it” and others to get an “F” in English 301: The Victorian Novel. At this time, software does better at certain types of language than others. One example is medical lingo. There’s a reason why lots of vendors have nifty STM (scientific, technical, and medical) demos.
Second, humans. Humans usually don’t know exactly what they want. Humans can recognize something that is sort of what they want. If the “something” is close enough for horseshoes, a human can take different fragments and glue them together to create an answer. These skills baffle software systems. The reason social curation works for finding information is that the people in a “circle” may be closer to the mind set of the person looking for information. Even if the social circle is clueless, the placebo effect kicks in and justifies the “good enough” method; that is, use what’s available and “make it work”, just like Project Runway contestants.
Third, smart software. Algorithms and numerical recipes, programmable search engines, fuzzy logic, and the rest of the PhD outputs are quite useful. The problem is that humans who design systems, by definition, are not yet able to create a system that can cope with the oddities that emerge from humans being human. So as nifty as Google is at finding a pizza joint in Alphabet City, Google and other systems may be wildly wrong as humans just go about their lives being unpredictable, idiosyncratic, irrational, and incorrect in terms of the system output.
I think there is innovation in search. Now my definition of innovation is very different from the Yahooligan’s. I am not interested is pointing out that most commercial and open source search systems just keep doing the basics. Hey, these folks went to college and studied the same general subjects. The variants are mostly tweaks to methods others know pretty well. After getting a PhD and going into debt, do you think a search engineer is going to change direction and “invent” new methods? I find that most of the search engineers are like bowling balls rolling down a gutter. The ball gets to the end of the lane, but the pins are still standing. Don’t believe me? Run a query on Ask.com, Bing.com, Google.com, or any other search system you can tap? How different are the results?
The challenge becomes identifying companies and innovators who have framed a findability problem in such a way that the traditional problems of search become less of an issue. Where does one look for that information? Not in blog posts from established companies whose track record in search is quite clear.
Stephen E Arnold, September 12, 2010
Freebie
Guha Still Going Strong: Spam Prevention in the PSE
September 7, 2010
I still think Ramanathan Guha is a pretty sharp Googler. I met a university professor who did not agree. Tough patooties. Here is the most recent patent application from the guru Guha: US20100223250, “Detecting Spam Related and Biased Contexts for Programmable Search Engines.”
A programmable search engine system is programmable by a variety of different entities, such as client devices and vertical content sites to customize search results for users. Context files store instructions for controlling the operations of the programmable search engine. The context files are processed by various context processors, which use the instructions therein to provide various pre-processing, post-processing, and search engine control operations. Spam related and biased contexts and search results are identified using offline and query time processing stages, and the context files from vertical content providers associated with such spam and biased contexts and results are excluded from processing on direct user queries.
What’s the significance? You will have to wait for one of the azurini to explain Guhaisms. I would note these clues:
- Context
- Entities
- query time processing stages.
But what does an addled goose know? Not much.
Stephen E Arnold, September 7, 2010
Freebie
Twitter: New Monetizing Play?
August 14, 2010
Data and text mining boffins like to crunch “big data.” The idea is that the more data one has, the less slop in the wonky “scores” that fancy math slaps on certain “objects.” Individuals think that his / her actions are unique. Not exactly. The more data one has about people, the easier it is to create some conceptual pig pens and push individuals in them. If you don’t know the name and address of the people, no matter. Once a pig pen has enough piggies in it (50 is a minimum I like to use as a lower boundary), I can push anonymous “users” into those pig pens. Once in a pig pen, the piggies do some predictable things. Since I am from farm country, piggies will move toward chow. You get the idea.
When I read “Twitter Search History Dwindling, Now at Four Days”, I said to myself, “Twitter can charge for more data.” Who knows if I am right, but if I worked at Twitter, I can think of some interesting outfits who might be interested in paying for deep Twitter history. Who would want “deep Twitter history?” Good question. I have written about some outfits, and I have done some interviews in Search Wizards Speak and the Beyond Search interviews that shed some light on these folks.
What can a data or text miner do with four days’ data? Learn that he / she needs a heck of a lot more to do some not-so-fuzzy mathy stuff.
Stephen E Arnold, August 14, 2010
Freebie.
Cloud and Context: Fuzzy and Fuzzier
August 11, 2010
I got a kick out of “Gartner Says Relevancy of Search Results Can be Improved by Combining Cloud and Context-Aware Services.” Fire up your iPad and check out this write up which has more big ideas and insights than Leonardo, Einstein, or Andy Rooney ever had. You will want to read the full text of the article. What I want to do is list the memes that dot the write up like chocolate chips in Toll House cookies. Here goes:
- value
- cloud-based services
- context-based services
- revenue facing external search installation
- informational services
- integration engineers
- contextual information
- value from understanding
- Web search efforts
- market dynamics
- general inclination
- search in the cloud
- discoverable information
- offloading
- quantifiable improvements
- social networking
- user’s explicit statement of interests
- rich profile data
Cool word choice, right? Concrete. Specific. Substantive. Now here’s the sentence that I was tempted to flag as a quote to note. I decided to include it in this blog post:
Optimizing search through the effective application of context is a particularly helpful and effective way to deliver valuable improvements in results sets under any circumstances.
Got that? Any circumstance. Well, except being understandable to an addled goose.
Stephen E Arnold, August 11, 2010
Freebie
Six Semantic Vendors Profiled
August 9, 2010
I saw in my newsreader this story: “Introducing Six Semantic Technology Vendors: Strengthening Insurance Business Initiatives with Semantic Technologies.” The write up is a table of contents or a flier for a report prepared by one of the azurini with a focus on what seems to be “life and non life insurance companies.”
For me the most interesting snippet in the advertisement was this sequence, which I have formatted to make more readable.
Attivio offers a common access platform combining unstructured and structured content [Note: one of Attivio’s founders has left the building. No details.]
Cambridge Semantics wants to help companies quickly obtain practical results [Note: more of a business intelligence type solution.]
Lexalytics has a ‘laser-focus’ on sentiment analysis. [Note: lots of search and content processing in a Microsoft centric wrapper.]
Linguamatics finds the nuggets hidden in plain sight. [Note: the real deal with a core competency in pharmaceuticals which I suppose is similar to life and non life insurance companies.]
MetaCarta identifies location references in unstructured documents in real-time. [Note: a geo tagging centric system now chased by outfits like MarkLogic, Microsoft, and lots of others]
SchemaLogic enables information to be found and shared more effectively using semantic technologies. [Note: I thought this outfit managed metatags across an enterprise. At one time, the company was focused on Microsoft technology. Today? I don’t know because when one of the founders cut out, my interest tapered off.]
The list and its accompanying prose are interesting to me for three reasons:
First, the descriptions of these firms as semantic does not map to my impression of the six firms’ technologies. I am okay with the inclusion of Cambridge Semantics and Linguamatics but I am not in sync with the azurini who plopped the other four outfits in the list. I think I can dredge up an argument to include these four firms on a content processing list, but gung-ho semantic technology. Nope.
Second, the link pointed me to a reseller of market research. The hitch in the git along for me was that the landing page did not point to the report. When I ran a query for “semantic technology vendors” I saw this message: “Sorry, no reports matching your search were found. For personal search assistance, please send us a request at contact@aarkstore.com.”
Third, the source of the report did not jump off the page at me. In short, what the heck is this document? How much does it cost? How can anyone buy it if the vendor’s search system doesn’t work and the write up on the Moso-technology.com Web site is fragmented.
I can’t recommend buying or not buying the report. Too bad.
Stephen E Arnold, August 9, 2010
Minority Report and Reality: The Google and In-Q-Tel Play
August 9, 2010
Unlike the film “Minority Report”, predictive analytics are here and now. More surprising to me is that most people don’t realize that the methods are in the cateogry of “been there, done that.”
I don’t want to provide too much detail about predictive methods applied to military and law enforcement. Let me remind you, gentle reader, that using numerical recipes to figure out what is likely to happen is an old, old discipline. Keep in mind that the links in this post may go dead at any time, particularly the link to the Chinese write up.
There are companies who have been grinding away in this field for a long time. I worked at an outfit that had a “pretzel factory”. We did not make snacks; we made predictions along with some other goodies.
In this blog I have mentioned over time companies who operate in this sector; for example, Kroll, recently acquired by Altegrity and Fetch Technologies. Now that’s a household name in Sioux City and Seattle. I have even mentioned a project on which I worked which you can ping at www.tosig.com. Other hints and clues are scattered like wacky Johnny Appleseed trees. I don’t plan on pulling these threads together in a free blog post.
© RecordedFuture, 2010. Source: http://www.analysisintelligence.com/
I can direct your attention to the public announcement that RecordedFuture has received some financial Tiger Milk from In-Q-Tel, the investment arm of one of the US government entities. Good old Google via its ventures arm has added some cinnamon to the predictive analytics smoothie. You can get an acceptable run down in Wired’s “Exclusive: Google, CIA Invest in ‘Future’ of Web Monitoring.” I think you want to have your “real journalist” baloney detector on because In-Q-Tel invested in RecordedFuture in January 2010, a fact disclosed on the In-Q-Tel Web site many moons ago. RecordedFuture also has a Web site at www.recordedfuture.com, rich with marketing mumbo jumbo, a video, and some semi-useful examples of what the company does. I will leave the public Web site to readers with some time to burn. If you want to get an attention deficit disorder injection, here you go:
The Web contains a vast amount of unstructured information. Web users access specific content of interest with a variety of Websites supporting unstructured search. The unstructured search approaches clearly provide tremendous value but are unable to address a variety of classes of search. RecordedFuture is aggregating a variety of Web-based news and information sources and developing semantic context enabling more structured classes of search. In this presentation, we present initial methods for accessing and analyzing this structured content. The RJSONIO package is used to form queries and manage response data. Analytic approaches for the extracted content include normalization and regression approaches. R-based visualization approaches are complemented with data presentation capabilities of Spotfire.
Taxodiary: At Last a Taxonomy News Service
August 3, 2010
I have tried to write about taxonomies, ontologies, and controlled term lists. I will be the first to admit that my approach has been to comment on the faux pundits, the so-called experts, and the azurini (self appointed experts in metatagging and indexing). The problem with the existing content flowing through the datasphere is that it is uninformed.
What makes commentary about tagging informed? Three attributes. First, I expect those who write about taxonomies to have built commercially-successful systems to manage terms lists and that those term lists are in wide use, conform to standards from ISO, ANSI, and similar outfits. Second, I expect those running the company to have broad experience in tagging for serious subjects, not the baloney that smacks of search engine optimization and snookering humans and algorithms with their alleged cleverness. Third, I expect the systems used to build taxonomies, manage classification schemes, and term lists to work; that is, a user can figure out how to get information out of a system relevant to his / her query.
Splash page for the Taxodiary news and information service.
How rare are these attributes?
Darned rare. When I worked on ABI/INFORM, Business Dateline, and the other database products, I relied on two people to guide my team and me. The first person is Betty Eddison, one of the leaders in indexing. May she rest in indexing heaven where SEO is confined to Hell. Betty was one of the founders of InMagic, a company on whose board I served for several years. Top notch. Care to argue? Get ready for a rumble, gentle reader.
The second person was Margie Hlava. Now Ms. Hlava, like Ms. Eddison, is one of the top guns in indexing. In fact, I would assert that she is on my yardstick either at the top or holds the top spot in this discipline. Please, keep in mind that her company Access Innovations and her partner Dr. Jay ven Eman are included in my reference to Ms. Hlava. How good is Ms. Hlava? Very good saith the goose.
IBM OmniFind 9.1: Trouble for Some Search Partners?
August 2, 2010
IBM has embraced open source. Now before you wade through the links for the new IBM OmniFind 9.1 search system, let me own up to a previous error. I did not believe that IBM would do much to make open source search a key part of the firm’s software strategy. I was wrong. IBM did or people like Mike McCandless did. Second, the decision to use Lucene and wrap IBM’s product strategy and pricing around it pretty much means that some of IBM’s favored enterprise search vendors are going to find themselves sitting home when IBM makes certain sales calls. Third, the IBM pricing strategy does not mean that enterprise search IBM-style is free. The idea is that IBM will be able to chase after Microsoft without the legacy of the $1.3 billion investment in Fast Search & Transfer, the legal and police muddle, and the mind boggling task of converting Fast into the broader vista of SharePoint. (Do you think my reference to “vista” evokes the Windows 7 predecessor? Silly you.)
Here’s what we have based on my poking around.
You get to license connectors. These puppies will be saddled with IBM pricing elements. This means that it will be tough for a customer to compare what he/she paid with what another customer paid. Bad for competitors too, but that’s a secondary issue compared to generating revenue. Run a query for part number BFG04CML. The adapters work with the UIMA standard.
You get to pay for the multi language option. Same pricing deal as connectors.
There is an email search component. which is available as “IBM OmniFind Personal E-Mail Search or IOPES. This works with Lotus Notes and Microsoft Outlook. IBM sales engineers may be able to bundle up the bits and pieces needed to stop outfits like the not well known Isys Search Software outfit from Australia from selling search to a Lotus Notes’ customer.
The security model reminds me of Oracle’s SES11g approach. You get a system and then get to buy components. Same pricing model again.
You can license a classification model. Same pricing mechanism.
If you already have an OmniFind search installation, you have to reindex after working through the update procedure. That sequence is too complex for a blog post, and if anyone wants a summary, I charge for it. The darned method was not particularly easy to locate on the IBM Web site. Sorry, I run a business.
You can still handle collections, but you have to set these up via the administrative interface or the configuration files.
If you have a bunch of IBM servers running OmniFind, you have to update each one in the search system. Have fun.
There is a Web crawler available, and I think our test showed that it called itself UFOcrawler.
For more information about OmniFind 9.1 click this link. Be patient. The new color is green which evokes the cost of the add ons and components. Nevertheless, bad news for some commercial search and content processing vendors accustomed to IBM’s throwing them bones. IBM is now eating those bones in my opinion. The sauce is open source. Tasty too.
Stephen E Arnold, July 30, 2010