An interview with Sergei Ananyan
Bloomington, Indiana, was the last place I expected to find a text mining and analytics company. Megaputer, according to company president Sergei Ananyan, chose this sleepy college time for "many reasons". He did not elaborate, but if an outfit with close ties to Moscow State University and a love of complex algorithms wants to live in the midst of the hustle and bustle of a Big 10 university with ready access to faculty, Bloomington is a good choice. I live in the middle of nowhere in Kentucky, so on my drive to Bloomington, I had time to sample the charms of rural Indiana and its rain-soaked landscape.
Megaputer is one of the firms that came on my radar when a colleague at a pharmaceutical company mentioned the firm to me several years ago. The firm provides quite a bit of descriptive information on its Web site. The highlights are that Megaputer's text mining technology makes it possible for a licensee to process, cluster, search, and categorize large collections of documents. The company sells a desktop version designed for an individual analyst as well as an on-premises enterprise version of the system. Taxonomies are quite the rage in organizations. Megaputer's system can automatically generate hierarchical taxonomies for document categorization. You can create and manage domain-specific semantic dictionaries, instruct the system to summarize documents, identify similar documents, create visualizations of correlated terms, and generate reports on dashboards for busy executives.
Megaputer has a branch office in Buffalo, New York and R&D centers located in Moscow, Russia and Cheboxary, Russia.
We are sitting in a coffee shop at the edge of the campus. Mr. Ananyan is a personable but reserved man. He enjoys talking about the Megaputer technology and found my clever references to the new film You Don't Mess with the Zohan slightly off topic. The full text of my conversation with Sergei Ananyan appears below:
What's a "Megaputer"? Where did the name come from?
Megaputer Intelligence develops advanced analytic tools for data and text mining, predictive modeling, multi-dimensional analysis and reporting.
The name “Megaputer Intelligence” designates fast and efficient extraction of knowledge from large volumes of data. The word Megaputer was the code name of a joint German-Russian research project aimed at building neuro-processor back in late 1980s. The name Megaputer implied that the neuro-processor would be much more powerful than ordinary computer processors due to its more advanced architecture. The project was terminated, but the idea of developing powerful computational systems based on superior analytic algorithms survived.
Note that the name of our company consists of two words. The word “Intelligence” illustrates that we empower customers to convert raw data to valuable knowledge. The word “Megaputer” illustrates that we can extract knowledge from data coming from virtually any source, and that our knowledge extraction techniques are scalable and efficient. The original data can be structured or textual, it can come from any database, statistical system, collection of documents, emails, or from the Internet. We help customers extract patterns of interest from textual data and create models capable of predicting outcomes of future situations.
Why are you in Bloomington, Indiana? Part of your team is in Moscow, right?
Megaputer Intelligence is a multi-national company with key operations in the USA and Russia.
Indeed, Megaputer Intelligence grew out of a research project in artificial intelligence conducted in Moscow, Russia in the beginning of 1990s. The first Megaputer office in Moscow was established in 1994. Now Megaputer Intelligence has its headquarters in Bloomington, and we have offices in a number of cities.
Megaputer Intelligence Inc. was founded in Bloomington, IN as a result of my connection with Indiana University. I am one of co-founders of the company. I am a native of Russia and I have a Ph.D. in nuclear physics. In 1997, I came to the USA to work on a contract with Indiana University. The same year we started the company Megaputer Intelligence in Bloomington. A few years later this office assumed the role of the headquarters because the majority of our customers were located in the USA. Now the R&D work and sales are carried out both in the USA and Russia. We maintain a relationship with Indiana University: many employees of Megaputer are IU graduates. As the company was growing, we launched a second R&D Center in Cheboxary, Russia in 2005 and a sales office in Buffalo, NY in 2006. I think we have a presence in about 15 countries now.
Where did the idea to create a data and content processing system originate?
Intelligent decisions provide new competitive edge. They are based on knowledge one can extract from quickly growing amounts of raw data. With the increasing complexity of the organization of the human society, with technical progress in data storage and computing power, and with the surge of electronic communications, the amount of stored data is growing exponentially. The better and faster one can extract knowledge from huge amounts of raw data, the stronger is the competitive edge he can achieve. It became obvious that the human brain is unable to cope with the quickly growing amounts of data requiring intelligent processing. The only solution capable of bridging the growing gap between data and decisions would be the development of powerful machine learning algorithms.
Since we were students at Moscow State University in 1980s, we were fascinated by the fields of artificial intelligence, machine learning and linguistics. This passion grew into a research project that resulted in developing the first version of the future leading data mining system PolyAnalyst™ and establishing Megaputer Intelligence as a commercial company conducting sales and further development of advanced analytic systems. Through years of relentless R&D work and numerous analytic projects carried out for large customers, the Megaputer team developed a scalable, comprehensive and tightly integrated analytic platform enabling customers to easily create custom data analysis solutions.
Upon an observation of the great need for the analysis of growing volumes of natural language text documents and daunting challenges people encountered in tackling this task, Megaputer brought to the market one of the first text mining tools TextAnalyst in 1998. Building on the success of this system, Megaputer incorporated more advance text mining tools for clustering, categorization, and summarization of documents, as well as entity extraction and taxonomy generation in its flagship platform PolyAnalyst. Now PolyAnalyst for Text™ helps customers solve standard tasks ranging from the analysis of survey and call center data, to processing safety reports in healthcare in transportation, and to fraud detection and subrogation prediction in insurance. PolyAnalyst is being used by over 25 of Fortune 100 companies, 9 US Federal Government agencies, and over 500 customers overall.
What differentiates your approach from the systems available from SAS and SPSS?
That's a good question. Let me come at it this way.
Megaputer does not compete directly with SAS or SPSS. When Megaputer Intelligence was launched in 1997, both SAS and SPSS were well established providers of statistical tools. For a newcomer, the best strategy was to introduce new analytic technologies and find new applications of existing technologies.
Megaputer enjoys being at the forefront of research. For example, Megaputer was the first company to provide powerful text mining capabilities in the framework of a data mining platform: we released PolyAnalyst for Text in 2002. This proved to be the right move and several years later, both SPSS and SAS purchased independent text mining vendors to improve their text analysis offerings.
Can you be more specific? What are some of the key differentiators?
Sure. Stop me if I don't give you enough detail.
First, we have made a significant effort to address ease of use of the system. Being overwhelmed by the complexity of the problem, the last thing customers want to deal with is the complexity of analytic tool itself. We make data analysis simple and customers love this. Megaputer has a joke slogan of “Developing data mining systems for kindergartners”.
Second, we are sticklers for accuracy. The use of innovative approaches to data analysis enables Megaputer to deliver more accurate results. In a recent head-to-head comparison with one of leading data mining tools carried out by a large high-tech company, PolyAnalyst consistently demonstrated about 20-25 percent better accuracy in classifying computer repair call center notes than the competing system. As a result, the customer selected Megaputer as its supplier of analytic tools.
It is a well-known one, but I am not able to reveal the details due to client confidentiality.
That's okay. What's point three?
Third is our algorithms. Maintaining its own strong team of applied mathematicians, Megaputer keeps providing new analytic algorithms not available in other data mining tools. Our recent additions included Bayesian Networks (including dynamic version), Fraud Detection MediCop™), and Time Series Analysis algorithms.
So you emphasize math in the way that Google does?
We value math, and I suppose we share that technical foundation. So, okay, we are good at math just like Google with one difference. I think we are specialists in the type of math necessary to make Megaputer solve our clients' problems.
Okay, now the fourth differentiator. We have a comprehensive set of analytic tools. To efficiently address data analysis tasks they are facing, many customers need to apply in one analysis a combination of data manipulation, text mining, predictive modeling, and report generating techniques. PolyAnalyst offers a comprehensive toolkit of all these capabilities on a single platform. Data manipulation and predictive modeling techniques work hand in hand with OLAP (multi-dimensional cubes) and built-in reporting engine.
And, point five is that we have a built-in reporting engine. For us, extracting knowledge from data and building a model is only one part of the story. It might be of critical importance to deliver these models and knowledge to the actual decision makers. We say that PolyAnalyst is the first comprehensive data mining system featuring its own flexible reporting tools for generating easy to comprehend custom reports for business users.
Megaputer keeps developing PolyAnalyst as a powerful and flexible analytic platform, but our real strength derives from the ability to build push-button custom solutions for handling typical tasks in various application domains.
When you go to a doctor with a severe headache, you do not care what devices or medications will the doctor use as long as the treatment that is efficient, painless, timely, easy to follow, and affordable.
You just want the headache to be gone, so that you can return to your normal life.
When customers turn to Megaputer, they are very much in the same situation: they have an overwhelming data analysis headache and they need to find a solution. They do not come to us out of their love for technology and in many cases they do not care what precisely we do in order to solve their problem. They just need an accurate, timely, easy to operate, and affordable solution. And this is what Megaputer strives to offer them in the form of domain specific custom analytic solutions. This is yet another important differentiator setting us apart from other data analysis vendors concentrating primarily on building tools.
When I looked at your system, I concluded that you had put quite a bit of functionality in your single user system? Are the customers able to deal with the options without getting confused or calling you for customer support?
Yes, you are correct.
PolyAnalyst offers quite a bit of data loading, manipulation, analysis and reporting capabilities. Our objective was to develop a powerful and scalable, yet easy to use analytical platform enabling users to quickly build diverse custom solutions. We have created a number of push-button solutions based on PolyAnalyst for different application domains including survey analysis, safety reports analysis, and fraud detection.
While providing users of PolyAnalyst with lots of functionality, we try to lower the learning curve for new users. We spend lots of thought and effort on keeping PolyAnalyst as simple in use as possible. We make every effort to simplify the user experience with the system. The user builds a data analysis scenario through an intuitive drag-and-drop interface. The developed scenario is represented as a graphical flow chart with editable nodes and can be shared for collaboration or scheduled as a task for future execution. The results of any analytical steps can be saved in an easy-to-comprehend and visually appealing report the user generates on the fly.
What types of users of your system have you identified?
There are two types of users of PolyAnalyst: Data Analysts and Decision Makers. Data Analysts seek flexibility and power in an analytical system. They can quickly find all tools they need to create a solution addressing their data analysis task and delivering the results of the analysis in a simple format to business users. Data analysts can efficiently master PolyAnalyst during a two-day user training seminar Megaputer offers to beginners.
Decision makers have no interest in learning a new analytical tool. They have a very limited time window for reviewing the results of the analysis and making an informed decision based on what they have learned. For this group of users, PolyAnalyst provides simplified interface that features collections of interactive up-to-date reports summarizing key results of the analysis performed by Data Analysts. Decision Makers can spend very little to no time at all in order to learn and start using PolyAnalyst.
While PolyAnalyst provides “data mining for kindergartners”, our “kindergartners” typically have a college degree. Now and then they find themselves calling Megaputer Support with technical questions. In these cases, we strive to provide them with quick and knowledgeable answers. In 2006 Megaputer received a DM Review Magazine Readership award for the best customer service (as rated by readers of the magazine).
What functionality does your server based solution provide?
Now you are going to test my memory. Let me say that PolyAnalyst Server is a powerful, scalable and user friendly platform that supports the complete data analysis process. PolyAnalyst incorporates tools that can provide many services. I will now try identify each of these.
A user can connect to external sources and load data from any standard database, statistical, and spreadsheet system. Megaputer can load a collection of documents from a file system, collect RSS feeds and email letters, or spider a Web site.
We also provide a federation function. This means that our system can integrate data from disparate sources.
As part of our federating capability, we handle numerous data manipulation, transformation and aggregation operations.
One feature we include is data cleansing (including Fragment Analysis and intelligent spell checking).
I have already mentioned our interest in machine learning. So, we use a broad range of algorithms. The user does not have to deal with these directly, and we work to refine our systems and methods.
Our system can carry out linguistic and semantic analysis of natural language text.
We provide graphical tools to assist the user in importing external knowledge bases or building her own hierarchical thesauri. With Megaputer, you do not have to license a separate utility to deal with word lists.
The system can integrate the result of analysis of textual and structured data.
It is easy for an analyst to perform graphical visualization of results. The system can create multi-dimensional cubes (OLAP).
One of the important features is that we provide graphical tools to make it easy to generate easy to comprehend custom reports summarizing results of an analysis. You can also set up and deliver condition-based e-mail alerts.
We provide a point-and-click interface to make it easy to schedule multi-step tasks for execution at a given time.
Did you cover the main features?
Well, I can go into more detail on each of these functions if you like.
No, let's shift focus slightly. SAS Institute recently purchased Teragram presumably to add to the Inxight Software functions. Does this change the game in text analytics?
No. We think we offer a highly desirable system. In many ways, we have robust text processing functions that put is at the forefront of this type of analysis.
What text processing functions do you offer?
Megaputer offers a broad selection of text mining algorithms. As a pioneer in creating text mining algorithms, Megaputer developed its own platform for linguistic and text mining analysis. We did not have to purchase third party text analysis technologies like other leading data mining vendors were forced to do to catch up with the market.
Megaputer incorporated various text mining algorithms in our flagship analytic system PolyAnalyst to facilitate tight integration of the results of text analysis with structured data analysis, visualization and reporting capabilities of the system.
Will you run down some of the functions you offer?
Glad to. PolyAnalyst can support language detection and intelligent spell checking.
Key word extraction, document clustering, pattern detection, and interactive visualization are supported as well.
Our approach to document classification is very robust. For example, we support both automated learning on pre-categorized examples (Support Vector Machine and Naive Bayesian algorithms), and categorization based on user-defined patterns in Pattern Definition Language are supported.
I mentioned our taxonomy creation function and entity extraction, I believe. We also can perform negative detection.
What's negation detection?
It is a technique for discovering that the text implies the opposite of the normal meaning of a term. For example, the sentence "I had never encountered any problem with the hard drive" implies that in fact the hard drive was just fine, despite the presence of the terms "problem" and "hard drive" found in the same sentence.
Okay, thanks. What else?
We perform sentiment analysis; that is, determine if a collection of documents are positive or negative on a topic.
I already touched upon our support for creating, importing, and managing hierarchical thesauri. We support creating, importing and managing hierarchical semantic thesauri as well.
We perform text OLAP, which makes certain types of analyses particularly useful.
One last point: PolyAnalyst enables you to search for patterns expressed in a Pattern Definition Language
Can you describe how a typical customer uses your system to process, analyze, and output reports that combine both structured data and unstructured data? In general, what's a typical customer do with your system?
We assume that the customer already has a collection of data that requires the analysis. The data can reside in a database, statistical or spreadsheet system, be available as a collection of documents in a file system in popular formats, or be accessible over the Internet. To derive knowledge from raw data, a typical customer would first decide whether it is possible to use one of standard or previously built PolyAnalyst data analysis scenarios, or she has to develop a new data analysis flowchart. If the user finds a suitable data analysis scenario to run, all that is left to do is to point it to the new data source, tweak a bit the parameters of the scenario, click the Execute button and then read off the results of the analysis.
If the user needs to develop a new data analysis scenario to meet specific objectives, then she would perform the following sequence of steps.
- Log in the system using a secure authentication mechanism
- Establish a data source connection for loading the data
- Perform the necessary data cleansing, aggregation and manipulation operations
- Run exploratory analysis to familiarize herself with the data
- Extract a subset of data for model training and verification purposes
- Either train a model on text record examples with known outcomes or use PDL to create a rule-based classification model
- Pass the results of text analysis to the input of machine learning techniques for further analysis and modeling
- Apply the developed model to data that requires scoring
- Visualize results, build multidimensional cubes and custom reports for business users
- Export results of the analysis to an external system used by the customer
- Schedule the developed data analysis scenario to run periodically on newly available data
Keep in mind that for a business user, the process of communicating with PolyAnalyst is much simpler. Upon logging in the system, the business user just needs to select a report of interest from a list of reports she is authorized to view. The user interacts with the report by changing report parameters, for example the date range, and performs drill downs to original data supporting the findings to verify their validity and view statistics of other associated attributes. Now the business user is ready to make informed data-driven decisions.
I know that you can display processed data in a number of report or visual styles. What do you do to make it easy to create queries and then display the results of those queries in a visual display?
The simplicity and user-friendliness of the offered analytic tools is of paramount importance for Megaputer.
Imagine that a manufacturer of the car would require that their customers have detailed knowledge of the structure of the engine and carburetor in order to drive this car. The majority of drivers would find this overwhelming and would have to seek alternative means of transportation. Confining the car operation devices to just two pedals and a steering wheel makes driving available to all potential customers. The same approach is very much true for data analysis.
We spent a lot of time trying to simplify the experience the user of PolyAnalyst would have. We did this prior to building the actual system. Among numerous features simplifying the use of PolyAnalyst, may I tell you a few features of key importance.
Yes, that would be helpful to me.
First, PolyAnalyst requires no knowledge of SQL language, advanced statistics, or other mathematical or IT skills. The user communicates with PolyAnalyst through simple point-and-click operations in an intuitive graphical interface. The data analyst builds a multi-step data analysis scenario in the form of a visual flowchart clearly conveying the flow of data through a sequence of actions. This flowchart can be saved in the form of a PolyAnalyst project, shared for collaboration with other analysts, and scheduled as a task for execution at a given time.
Second, PolyAnalyst caches outputs of all actions performed on data in the flowchart, simplifying the process of adding these outputs to summary reports for business users. To create a custom report on results, the data analyst readily drags and drops the necessary results of the analysis to the report canvas and selects the desired layout and parameterization of these results.
Third, the business user views easy-to-comprehend summaries of results presented in the form of visually appealing graphical reports. She is completely shielded from any complexities of the performed analysis and can concentrate on the decision making process.
What’s the range of math "under the hood" of your system?
Megaputer takes pride in providing one of the broadest selections of mathematical algorithms for data and text analysis. PolyAnalyst furnishes both advanced rule-based procedures and a variety of adaptive self-learning algorithms.
The user of PolyAnalyst can create text search and categorization queries expressed in terms of operators available in a powerful Pattern Definition Language (PDL). PolyAnalyst taxonomy based categorization with categories defined through PDL expressions is a very accurate classification engine. PDL enables users to perform searches for even quite subtle patterns by utilizing various relations between terms:
- Logical. AND, OR, NOT, and XOR.
- Proximity. Determines how many terms, sentences, or paragraphs apart the two terms should occur.
- Ordering. Determines in what order the terms of interest are encountered in a document.
- Linguistic. Performs stemming and finds different morphological modifications of the term.
- Semantic. Capitalizes on relationships between terms (synonyms, kind of, part of, etc.) captured in imported or user-defined thesauri.
- Phonetic. Finds terms that sound similarly but do not have the same spelling.
- Negations. Performs negation detection.
- Sentiment. Carries out sentiment analysis.
PDL operators can be embedded in each other, allowing the user to search for precise signatures of patterns of interest. In combination with domain-specific hierarchical semantic thesauri supported by PolyAnalyst, the use of PDL turns the classification engine in a powerful knowledge discovery tool.
In addition, PolyAnalyst offers a variety of adaptive machine learning algorithms for text and structured data analysis. For example, it can build a model for classifying text records to specified categories by self-learning on a collection of pre-categorized examples. This model building process can be powered either by Support Vector Machine or by Naive Bayesian algorithm.
What about predictive modeling?
For predictive modeling, PolyAnalyst offers algorithms for solving tasks of clustering, categorization, numerical prediction, pattern and anomaly detection, affinity grouping, dimension reduction, and time series analysis. For addressing these tasks PolyAnalyst offers analytical engines ranging from Neural Networks and Decision Trees, to R-Forest, Bayesian Networks, CHAID and Evolutionary Programming algorithms.
If a customer wants to customize or adjust thresholds for certain functions, how much freedom do you give your licensees? If the licensee wants to customize or integrate your tools into other applications, how does the licensee do this?
While the majority of PolyAnalyst algorithms produce meaningful results when running with default settings out-of-the-box, the user can use advanced settings tabs to adjust thresholds for different algorithms.
Users interested in integrating different capabilities of PolyAnalyst in other applications can use a library of PolyAnalyst modules provided by Megaputer. The integration can be carried out through a J2EE API.
As you look at the search and content processing landscape, what are the three big trends that are evident to you? How is your firm adapting to these trends?
I see the following main trends in content analysis. Customers are seeking analytical systems capable of performing the several important functions. Specifically, you can build easy-to-use custom solutions developed to solve specific problems. You can perform joint analysis of text and structured data You can do analysis of text in multiple languages and similar advanced operations.
Megaputer already offers various means that enable customers to follow the first two trends. We plan to support the analysis of documents in multiple languages by the end of 2008.
I know you cannot reveal any company secrets, but what are the types of enhancements that licensees can expect in the next release of your systems and software?
The next release of PolyAnalyst will contain several important enhancements. I can give you a flavor of several important features. First, we are strengthening our scalability using parallel processing and a 64-bit implementation of the system. We will be introducing a browser-based reporting system. We will be supporting text analysis in multiple languages.
What's the timing of the new version?
We plan to announce the new PolyAnalyst upgrade in the fourth quarter of 2008.
Megaputer is an interesting company with an impressive product. Our experience with the system revealed that it was easy to use and allowed interactive exploration of data in structured and unstructured formats. We particularly like the broad range of built-in features which eliminated the need to use third-party tools to perform certain data normalization tasks. If you are looking for a business intelligence system, be sure to include a test drive of Megaputer's software.
Stephen E. Arnold
June 23, 2008