September 29, 2014
Navigate to “Postgres Full Text Search Is Good Enough.” I first heard this argument at a German information technology conference a few years ago. The idea is surprisingly easy to understand. As long as a user can bang in a couple of key words, scan a result list, and locate information that the user finds helpful—job done. The search results may consist of flawed or manipulated information. The search results may be off point for the user’s query when evaluated by old fashioned methods such as precision and recall. The user may be dumb and relies on what the user finds accurate.
This write up explains the good enough approach in terms of PostgreSQL, a useful open source Codd type data management system. Please, note. I am not uncomfortable with good enough search. I understand that when the herd stampedes, it is not particularly easy to stop the run. Prudence suggests that one take cover.
Here’s the guts of the write up:
What do I mean by ‘good enough’? I mean a search engine with the following features:
- Ranking / Boost
- Support Multiple languages
- Fuzzy search for misspelling
- Accent support
Luckily PostgreSQL supports all these features.
The write up contains some useful code snippets to make use of search features. The discussion of full text search is coherent and addresses a vast swath of content. Note that proprietary vendors have tilled acres of marketing earth and fertilizer to convert search into a mind boggling range of functions.
This article includes code snippets to tackle full text within PostgreSQL.
Querying is included as well. Again, code snippets are included. (My teenage advisors said, “Very useful snippets.” Okay. Good.
The write up concludes:
We have seen how to build a decent multi-language search engine based on a non-trivial document. This article is only an overview but it should give you enough background and examples to get you started with your own….Postgres is not as advanced as ElasticSearch and SOLR but these two are dedicated full-text search tools whereas full-text search is only a feature of PostgreSQL and a pretty good one
Reasonable observation. Worth reading.
If you are a vendor of proprietary search technology, there will be more individuals infused with the sprit of open source, not fewer. How many experts are there for proprietary systems? Fewer than the cadres of open source volk I surmise.
Stephen E Arnold, September 29, 2014
September 25, 2014
MarkLogic, founded more than a decade ago, is an interesting company. I heard that Google kicked its tires because Christopher Lindblad is a true wizard.
The outfit offers an Extensible Markup Language data management solution. Over the years, the company has positioned the system to slice and dice content for publishers, intelligence analysis for government entities, and enterprise search. Along the way, the company’s technology has been shaped to meet the needs of the pivoting forces in content processing. Stated another way, when one thing won’t sell at a pace to keep investors happy, try another way. In the course of its journey, the company brushed against Oracle and then found itself snarled in the confusion between JSON and XML and the sort of open proprietary extensions to the query language used to extract results from the XML store only to get buffeted by the hoo hah about Hadoop and assorted open source alternatives to Codd databases. Wow.
I read a content marketing / public relations story called “MarkLogic Expands Global Reach with New Offices in Chicago.” Check the source quickly because some BusinessWire content can disappear or become available to those who fork over dough to the “news” service. The write up asserted:
“The opening of these new offices is well-timed for the growing number of global customers who need the enterprise grade NoSQL solutions we are delivering to US-based customers,” said David Ponzini, senior vice president of corporate development, MarkLogic. “We are in an advantageous position to make an immediate impact in Europe and Southeast Asia. We continue broadening the market awareness for MarkLogic throughout the world.”
The trick, of course, will be to blast through the financial goals for the company set by the investors years ago. A failure to produce more than $60 million in revenues a several years ago led to the departure of one president. A couple of more senior executives have spun through the revolving door not too far from Google Island with its quirky dinosaur skeleton. Does that skeleton stand as a metaphor to proprietary software solutions?
In my view, the business thinking at work is more sales offices equals more sales. I once had an office in Manhattan even though I worked in Illinois. The cost was about $20 per month. I had an address on Park Avenue, south unfortunately and a 212 phone number. I made a sale or two to an organization run by John Suhler, but I quickly figured out that the key to making sales was my being in and around midtown.
I thought I read that outfits like IBM are going to a “no office” approach. Maybe MarkLogic has identified a solution to the overhead associated with full time equivalents and physical space? That begs another question, “What does MarkLogic know that IBM does not know?”
Some vendors have found that more sales offices increase costs without generating sufficient revenue to cover the overhead, miscellaneous costs and in country marketing expenses. I can name several Paris, France based content processing companies who learned first hand that additional offices are a very, very expensive proposition. Other companies leverage partners for revenues. In one of my industry reports, I pointed out that prior to the sale of Autonomy to HP, Autonomy figured out a hybrid sales model that seemed to work as long as Dr. Lynch was cracking the whip. Remove the management, the partnering model can go off the rails.
Don’t get me wrong. XML is a wonderful solution to certain types of information challenges. Thomson Reuters can produce hundreds of for fee publications using XQuery and XSLT with proprietary extensions. A quick look at Thomson Reuters financial results suggest that more may be needed by this company than a foundation and an XML data store.
How quickly will MarkLogic deliver a five or ten X return on the $70 million investors have pumped in. In today’s market, cranking out $300 to $700 million in revenues from content processing technology that competes with open source alternatives is a tall order.
Maybe more sales offices will do it? My hunch is that more closed deals is the evidence some stakeholders seek.
Stephen E Arnold, September 25, 2014
August 27, 2014
In Homer’s Odyssey, the character Cassandra had the gift of prophesy, but she was also cursed to where no one believed her. The NoSQL database of the same name shared a similar problem when it first started, but unlike the tragic heroine it has since grown to be a popular and profitable bit of code. Wired discusses Cassandra’s history and current endeavors in “Out In the Open: The Abandoned Facebook Tech That Now Helps Power Apple.”
Cassandra is the brainchild of Jonathan Ellis and he used it to found DataStax. Facebook used Cassandra to better scale information across machines and open sourced it in 2008. It faded into the background for a while, but DataStax continued to gain traction with its proprietary software. Apple has since joined the Cassandra community and is its second largest contributor. DataStax, however, will not acknowledge that Apple is one of its clients.
The article points out that a single database product cannot reign supreme in 2014’s market. New ways to house and utilize data will continue to grow, much of it driven by open source. What does that mean for DataStax and Cassandra?
“Ellis says the strategy for Cassandra and DataStax will be ensuring that its technology can work with any new technology that can come along. For example, DataStax recently released a connector for Spark that will enable developers to easily use Spark to analyze data stored in Cassandra. ‘We’re trying to be the database that drives our application, not necessarily the analytics,’ he says. ‘There’s nothing that marries us to one of those platforms.’”
From reading this, it seems the big data push has quieted down somewhat, but companies based on open source software are trying to create products that allow people to use their data smarter and without the holdups of earlier big data pushes. One thing for sure is if DataStax truly does have Apple as a client, they can kiss success on the mouth.
August 13, 2014
The explanatory article on MacLochlainns Weblog titled Hiding a Java Source offers information for those interested in concealing a Java source in an Oracle database. It is a relatively brief article that consists of straightforward instructions. The article begins,
“The ability to deploy Java inside the Oracle database led somebody to conclude that the source isn’t visible in the data catalog. Then, that person found that they were wrong because the Java source is visible when you use a DDL command to CREATE, REPLACE, and COMPILE the Java source. This post discloses how to find the Java source and how to prevent it from being stored in the data catalog.”
The article concludes with instructions on how to ascertain that the Java source is compiled outside the database. Obviously, this article is only intended for white hate reasons, right? Michael McLaughlin, the author of the blog, has a long history with Oracle, going back to Oracle 6. He has written several handbooks on Oracle and teaches database technology at BYU-Idaho. The blog used to be focused solely on Oracle as well, but now offers posts on a range of topics from Java to Mac OS to Microsoft Excel and more.
Chelsea Kerwin, August 13, 2014
July 4, 2014
The purported father of NoSQL, Norman T. Kutemperor, made an appearance at this year’s Enterprise Search & Discovery conference, we learn from “Scientel Presented Advanced Big Data Content Management & Search With NoSQL DB at Enterprise Search Summit in NY on May 13” at IT Business Net. The press release states:
“Norman T. Kutemperor, President/CEO of Scientel, presented on Scientels Enterprise Content Management & Search System (ECMS) capabilities using Scientels Gensonix NoSQL DB on May 13 at the Enterprise Search & Discovery 2014 conference in NY. Mr. Kutemperor, who has been termed the Father of NoSQL, was quoted as saying, When it comes to Big Data, advanced content management and extremely efficient searchability and discovery are key to gaining a competitive edge. The presentation focused on: The Power of Content – More power in a NoSQL environment.”
According to the write-up, Kutemperor spoke about the growing need to manage multiple types of unstructured data within a scalable system, noting that users now expect drag-and-drop functionality. He also asserted that any NoSQL system should automatically extract text and build an index that can be searched by both keywords and sentences. Of course, no discussion of databases would be complete without a note about the importance of security, and Kutemperor emphasized that point as well.
The veteran info-tech company Scientel has been in business since 1977. These days, they focus on NoSQL database design; however, it should be noted that they also design and produce optimized, high-end servers to go with their enterprise Genosix platform. The company makes its home in Bingham Farms, Michigan.
Cynthia Murrell, July 04, 2014
June 6, 2014
It is a situation we have all faced. We are watching our favorite program, and then suddenly a song starts to play in the background. As the song emphasizes the action on screen, we have trouble identifying it. A smartphone might not be handy with a song recognition app and by the time it is downloaded the song is over. What do you do then? Beyond the obvious of rewinding (if you have that option), be glad that the Internet has a solution. LifeHacker tells us that “TuneFind Tells You What Songs Are In TV Episodes And Movies.”
There is now an entertainment database for everything online. TuneFind allows users to browse and search to find that song stuck in your head.
“TuneFind’s library is pretty extensive for both TV shows and movies. You can browse by shows, movies, and artists, but you can also browse by what’s popular. It’s pretty cool to see what other users have been searching for the most over the last week, month, and year. For TV shows, the selection goes back a ways, but nothing from the early 90s and earlier seems to be present. I’m probably wrong, but the earliest I could find was 1999′s excellent Freaks and Geeks. For movies the reach back is about the same.”
TuneFind works the same as other online databases and the content is extensive considering it goes back to 1999. If you also see something an actor’s worn on TV, you’ll also enjoy WornOnTV. Does anybody sense the next wave of advertisement and MTV?
May 14, 2014
I read “Europe’s Top Court: People Have Right to Be Forgotten on Internet.” Fascinating. The real news article said, “People can ask Google to delete sensitive information from its Internet search results.” The source of the assertion was Europe’s top court. After I read the item, I wondered what was being “deleted” and “from where”? When it comes to removing content, the concept of deletion may need some of Mr. Bill Clinton’s “is” type thinking. Content can disappear. An example would be information from government servers. In some cases, the removal of content is intentional. In others, a system administrator performs and operation and – poof – content is history.
Digital information is like “dark matter.” It may be hard to detect, but some people know that it is very real. For example, poke around the Internet Archive Wayback Machine. There is some interesting information on that system that may be otherwise difficult, if not impossible, to access.
Then there is the problem of deleting content from data management systems. I am confident that Europe’s top court knows that removing an item from an index does not remove the item from the data management system, back ups, or mirrors of content residing “out there” on the Internet or on a researcher’s personal computer.
The notion of deleting is fuzzy to me.
Almost as fascinating is the question of who gets to “remove” what? What are the procedures for getting content deleted from Google or any other system? How does one know that the information is gone? Run a query on a free Web search engine? A commercial system?
Like many ideas in the category “barn burned and horses gone”, deleting content from the “Internet” may be a challenging issue to resolve. In the case of removing content from some of the major online search systems, a Costco has already been erected on the site where the barn once stood, horses grazed, and sun touched information farmers once raised their data crops.
Stephen E Arnold, May 15, 2014
May 13, 2014
It is time for people to understand that relational databases were not made to handle big data. There is just too much data jogging around in servers and mainframes and the terabytes run circles around relational database frameworks. It is sort of like a smart fox toying with a dim hunter. It is time that more robust and reliable software was used, like Hadoop. GCN says that there are “5 Ways Agencies Can Use Hadoop.”
Hadoop is an open source programming framework that spreads data across server clusters. It is faster and more inexpensive than proprietary software. The federal government is always searching for ways to slash cuts and if they turn to Hadoop they might save a bit in tech costs.
“It is estimated that half the world’s data will be processed by Hadoop within five years. Hadoop-based solutions are already successfully being used to serve citizens with critical information faster than ever before in areas such as scientific research, law enforcement, defense and intelligence, fraud detection and computer security. This is a step in the right direction, but the framework can be better leveraged.”
The five ways the government can use Hadoop is to store and analyze unstructured and semi-structured data, improve initial discovery and exploration, making all data available for analysis, a staging area for data warehouses and analytic data stores, and it lowers costs for data storage.
So can someone explain why this has not been done yet?
May 12, 2014
InfoWorld reports there is going to be a halt in progress for Oracle in the article, “Beware Of NoSQL Standards In Oracle’s Clothing.” Industry standards help regulate and control information technology. They can even help push IT forward, but according to some anonymous sources Oracle is trying to make NoSQL startups sign up for a standards body in order to slow down change.
Just the very idea of this happening is sickening for the open source community:
“In reality, big vendors use standards to halt their larger customers from adopting new technology or create weird new-old hybrids to keep the old ways alive. There are many companies that, once they see a standardization effort, will wait for the BigCo-supported standard to be adopted before they upgrade their tech stack. Since such adoption tends to be slow anyhow, this is an effective delaying tactic. Meanwhile, the big vendor works to control the standards body.”
Oracle wants to slow down progress, because it eats into their profit margin. Oracles wants the future come at a pace it chooses, where they will control the market, get patented technology under FRAND terms, and buy up NoSQL vendors.
Standardization is a good thing, but Oracle needs to realize that relational databases are too small to handle the amount of big data in systems. It’s a call to arms for the open source community to fight relying on outdated technology. Echoes of keeping video rental stores over streaming services are in this.
May 9, 2014
In what they are calling its “biggest release ever,” the updated open source MongoDB 2.6 boasts even more features than before. Application Development Trends describes the improvements in, “MongoDB Releases Major Upgrade to NoSQL Database.” MongoDB Inc. has done the math, and says MongoDB is the now leading NoSQL database. The company also has high hopes for the future.
The article describes one concept key to the new version:
“The improved query engine features a new index intersection that will fulfill queries that are supported by more than one index. Also, index filters will limit the indexes that can ‘become the winning plan for a query.’ Developers using the database can now use the count method in conjunction with the hint method. You can learn more about that here.”
Writer David Ramel turns his attention to security:
“Security improvements include better SSL support, x.509 authentication, an enhanced authorization system that features more granular controls, centralized storage of credentials and better tools for user management. The new version also features TLS encryption, along with user-defined roles, auditing functionality and field-level redaction, which Horowitz described as ‘a critical building block for trusted systems.’ The database auditing feature is extended by the new capability to integrate with IBM InfoSphere Guardium.”
MongoDB CTO and co-founder Eliot Horowitz reports that his team has re-written the query execution engine for better scalability. The upgrade also includes an easier-to-maintain codebase, the ability to return result sets in any size, and improved support for bulk operations. Horowitz notes that this version includes the groundwork for improvements planned for version 2.8, like document-level locking. See the articles for more improvements and details.
The company behind the open source MongoDB database, MongoDB Inc. makes their money on related management services. Launched in 2007, the company has offices throughout North America, Europe, and the Asia-Pacific region.
Cynthia Murrell, May 09, 2014