Big Data and Search Solving Massive Language Processing Headaches
December 4, 2017
Written language can be a massive headache for those needing search strength. Different spoken languages can complicate things when you need to harness a massive amount of data. Thankfully, language processing is the answer, as software architect Federico Thomasetti wrote in his essay, “A Guide to Natural Language Processing.”
According to the story:
…the relationship between elements can be used to understand the importance of each individual element. TextRank actually uses a more complex formula than the original PageRank algorithm, because a link can be only present or not, while textual connections might be partially present. For instance, you might calculate that two sentences containing different words with the same stem (e.g., cat and cats both have cat as their stem) are only partially related.
The original paper describes a generic approach, rather than a specific method. In fact, it also describes two applications: keyword extraction and summarization. The key differences are:
- the units you choose as a foundation of the relationship
- the way you calculate the connection and its strength
Natural language processing is a tricky concept to wrap your head around. But it is becoming a thing that people have to recognize. Currently, millions of dollars are being funneled into perfecting this platform. Those who can really lead the pack here will undoubtedly have a place at the international tech table and possibly take over. This is a big deal.
Patrick Roland, December 4, 2017
Google Maps Misses the Bus
December 4, 2017
Google Maps is the preferred GPS system for millions of people. It uses real-time information to report accidents and stay updated on road conditions. It is great when you are driving or walking around a city, but when it comes to public transportation, especially to the airports, Google ignores it. City Lab discusses, “Why Doesn’t Google Maps Know The Best Way To the Airport?”
Speaking from personal experience on a recent trip to New York City, I had to get from Queens to LaGuardia airport. Google Maps took me the most roundabout way possible, instead of routing me to direct trains and buses. Google’s directions may have required less train switching, but it took me in the opposite direction of my destination.
Google Maps has a problem listing airport specific transportation in its app, but it really should not be a problem.
As Google describes things, putting those city-to-terminal routes into its mapping apps shouldn’t be that hard. A transit operator has to apply to be listed in Google Transit, publish its schedule in the standard General Transit Feed Specification (GTFS) format, and have Google run some quality tests on that feed before factoring it into directions.
But some smaller transit operations don’t get to the first step. They don’t even know it’s an option.
Transportation services may not know how to be added to Google, but Google also not reached out to them. Historically, Google has only reached out to large transportation entities, because it meant more business on their end. Google also has this weird clause transportation services need to sign before their information is added to Google Maps. It alleviates Google from “any defects in the data” and it sounds like Google does not want to be held responsible for misinformation displayed on Google Maps.
Whitney Grace, December 4, 2017
The Worlds Wealthiest People Should Fear Big Data
November 24, 2017
One of the strengths that the planets elite and wealthy have is secrecy. In most cases, average folks and media don’t know where big money is stored or how it is acquired. However, that recently changed for The Queen of England, several Trump cabinet members, and other powerful men and women. And they should be afraid of what big data and search can do with their info, as we learned in the Guardian’s piece, “Paradise Papers Leak Reveals Secrets of the World’s Elite Hidden Wealth.”
The story found a lot of fishy dealings with political donors and those in power, Queen Elizabeth having tax-free money in the Caymans and more. According to the story:
At the centre of the leak is Appleby, a law firm with outposts in Bermuda, the Cayman Islands, the British Virgin Islands, the Isle of Man, Jersey and Guernsey. In contrast to Mossack Fonseca, the discredited firm at the centre of last year’s Panama Papers investigation, Appleby prides itself on being a leading member of the “magic circle” of top-ranking offshore service providers.
Appleby says it has investigated all the allegations, and found “there is no evidence of any wrongdoing, either on the part of ourselves or our clients”, adding: “We are a law firm which advises clients on legitimate and lawful ways to conduct their business. We do not tolerate illegal behaviour.
Makes you wonder what would happen if some of the brightest minds in search and big data got ahold of this information? We suspect a lot of the financial knots this money ties to keep itself concealed would untangle. In an age of increasing transparency, we wouldn’t be shocked to see that happen.
Patrick Roland, November 24, 2017
Google Relevance: A Light Bulb Flickers
November 20, 2017
The Wall Street Journal published “Google Has Chosen an Answer for You. It’s Often Wrong” on November 17, 2017. The story is online, but you have to pay money to read it. I gave up on the WSJ’s online service years ago because at each renewal cycle, the WSJ kills my account. Pretty annoying because the pivot of the WSJ write up about Google implies that Google does not do information the way “real” news organizations do. Google does not annoy me the way “real” news outfits handle their online services.
For me, the WSJ is a collection of folks who find themselves looking at the exhaust pipes of the Google Hellcat. A source for a story like “Google Has Chosen an Answer for You. It’s Often Wrong” is a search engine optimization expert. Now that’s a source of relevance expertise! Another useful source are the terse posts by Googlers authorized to write vapid, cheery comments in Google’s “official” blogs. The guts of Google’s technology is described in wonky technical papers, the background and claims sections of the Google’s patent documents, and systematic queries run against Google’s multiple content indexes over time. A few random queries does not reveal the shape of the Googzilla in my experience. Toss in a lack of understanding about how Google’s algorithms work and their baked in biases, and you get a write up that slips on a banana peel of the imperative to generate advertising revenue.
I found the write up interesting for three reasons:
- Unusual topic. Real journalists rarely address the question of relevance in ad-supported online services from a solid knowledge base. But today everyone is an expert in search. Just ask any millennial, please. Jonathan Edwards had less conviction about his beliefs than a person skilled in the use of locating a pizza joint on a Google Map.
- SEO is an authority. SEO (search engine optimization) experts have done more to undermine relevance in online than any other group. The one exception are the teams who have to find ways to generate clicks from advertisers who want to shove money into the Google slot machine in the hopes of an online traffic pay day. Using SEO experts’ data as evidence grinds against my belief that old fashioned virtues like editorial policies, selectivity, comprehensive indexing, and a bear hug applied to precision and recall calculations are helpful when discussing relevance, accuracy, and provenance.
- You don’t know what you don’t know. The presentation of the problems of converting a query into a correct answer reminds me of the many discussions I have had over the years with search engine developers. Natural language processing is tricky. Don’t believe me. Grab your copy of Gramatica didactica del espanol and check out the “rules” for el complemento circunstancial. Online systems struggle with what seems obvious to a reasonably informed human, but toss in multiple languages for automated question answer, and “Houston, we have a problem” echoes.
I urge you to read the original WSJ article yourself. You decide how bad the situation is at ad-supported online search services, big time “real” news organizations, and among clueless users who believe that what’s online is, by golly, the truth dusted in accuracy and frosted with rightness.
Humans often take the path of least resistance; therefore, performing high school term paper research is a task left to an ad supported online search system. “Hey, the game is on, and I have to check my Facebook” takes precedence over analytic thought. But there is a free lunch, right?
In my opinion, this particular article fits in the category of dead tree media envy. I find it amusing that the WSJ is irritated that Google search results may not be relevant or accurate. There’s 20 years of search evolution under Googzilla’s scales, gentle reader. The good old days of the juiced up CLEVER methods and Backrub’s old fashioned ideas about relevance are long gone.
I spoke with one of the earlier Googlers in 1999 at a now defunct (thank goodness) search engine conference. As I recall, that confident and young Google wizard told me in a supercilious way that truncation was “something Google would never do.”
What? Huh?
Guess what? Google introduced truncation because it was a required method to deliver features like classification of content. Mr. Page’s comment to me in 1999 and the subsequent embrace of truncation makes clear that Google was willing to make changes to increase its ability to capture the clicks of users. Kicking truncation to the curb and then digging through the gutter trash told me two things: [a] Google could change its mind for the sake of expediency prior to its IPO and [b] Google could say one thing and happily do another.
I thought that Google would sail into accuracy and relevance storms almost 20 years ago. Today Googzilla may be facing its own Ice Age. Articles like the one in the WSJ are just belated harbingers of push back against a commercial company that now has to conform to “standards” for accuracy, comprehensiveness, and relevance.
Hey, Google sells ads. Algorithmic methods refined over the last two decades make that process slick and useful. Selling ads does not pivot on investing money in identifying valid sources and the provenance of “facts.” Not even the WSJ article probes too deeply into the SEO experts’ assertions and survey data.
I assume I should be pleased that the WSJ has finally realized that algorithms integrated with online advertising generate a number of problematic issues for those concerned with factual and verifiable responses.
Searx: Another Privacy Oriented Web Search System
November 13, 2017
There are a number of privacy oriented Web search systems. If you want to poke around, try the quirky Unbubble or give Gibiru a whirl. I noted another entrant called Searx. There are some important differences. Searx is a system which takes a page from peer to peer access systems. You host it yourself. The system is a metasearch engine like Ixquick (Startpage). This means that the user’s query is converted to the query syntax used by search systems like Bing.com. The results are merged and a results list displayed. Deduplication is a slippery fish. You will need to scan the results and run through the familiar, but much maligned procedure of scan, click, browse, and save the Web page with the information you want. If you are like a millennial, you will take the first result because everything on the Web is true.
Stephen E Arnold, November 13, 2017
Ichidan Simplifies Dark Web Searches
November 10, 2017
Now there is an easier way to search the Dark Web, we learn from a write-up at Cylance, “Ichidan, a Search Engine for the Dark Web.” Cybersecurity pro and writer Kim Crawley informs us:
Ichidan is a search engine for looking up websites that are hosted through the Tor network, which may be the first time that’s been done at this scale. Websites on Tor usually have the .onion top level domain and you typically need a web browser with the Tor plugin or Tor’s own configured web browser in order to access them. … The search engine is less like Google and more like Shodan, in that it allows users to see technical information about .onion websites, including their connected network interfaces, such as TCP/IP ports.
Researchers at BleepingComputer explored the possibilities of this search engine. They were able to reproduce OnionScan’s findingss on the shrinkage of the Dark Web—the number of Dark Web services decreased from about 30,000 in April 2016 to about 4,400 not quite a year later (so by about 85%). Researchers found this alarming capability, too:
BleepingComputer was also able to use Ichidan to find a website which a lot of exposed ports, including OpenSSH, an email server, a Telnet implementation, vsftpd, and an exposed Fritzbox router. That sort of information is very attractive to cyber attackers. Using Ichidan is a lot easier than command line pentesting tools, which require more specific technical know-how.
Uh-oh. Crawley predicts that use of Icihan will grow as folks on both sides of the law discover its possibilities. She advises anyone administering a .onion site to strengthen their cyber defenses posthaste, “if they want to survive.”
Cynthia Murrell, November 10, 2017
Reddit Search Improves with Lucidworks
November 10, 2017
YouTube might swallow all of your free time with videos, but Reddit steals your entire life with videos, plus images, GIFS, posts, jokes, and cute pictures of doggos, danger noodles, trash pandas, and floofs. If you do not know what those are, then shame on you. If you are a redditor, then you might have noticed that the search function stinks worse than a troll face. According to TechCrunch, Reddit has finally given their search function a facelift, “Reddit Teams With Lucidworks To Build New Search Framework.”
Reddit has some serious stats when it comes to user searches and postings. The online discussion platform has more than 500 million users, generates 5 million comments, and 40 million searches are conducted each day. While one of Reddit’s search challenges is dealing with the varied content, another is returning personalized search results without redactors having to explicitly write them in the search box.
Reddit’s poor search performance is legendary and its head honchos wanted to improve it, but trying to find the time to fix it was a problem. That is why they hired Lucidworks to do the job for them:
Caldwell said that the company went with the Lucidworks Fusion platform because it had the right combination of technology and the ability to augment his engineering team, while helping search to continually evolve on Reddit. Buying a tool was only part of the solution though. Reddit also needed to hire a group of engineers with what Caldwell called “world class search and relevance engineering expertise.” To that end, he has set up a 30-person engineering search team devoted to maximizing the potential of the new search platform.
Lucidworks currently remains in charge of fixing Reddit’s search issues, but eventually, Reddit will take over. Within a few searches for danger noodle, floof, and doggo not only have more accurate results, but you can learn the aww language lingo through the results
Whitney Grace, November 10, 2017
Treating Google Knowledge Panel as Your Own Home Page
November 8, 2017
Now, this is interesting. Mike Ramsey at AttorneyAtWork suggests fellow lawyers leverage certain Google tools in, “Three Reasons Google Is Your New Home Page.” He points out that Google now delivers accessibility to many entities directly on the results page, reducing the number of clicks potential clients must perform to contact a firm. He writes:
[Google] has rolled out three products that provide potential clients with information about your law firm before they get to your site:
*Messages (on mobile)
*Questions and Answers (on mobile)
*Optional URLs for booking appointments (both mobile and desktop)
This means that Google search results are becoming your new ‘home page.’ All three products — Messages, Questions and Answers and URLs for appointments — are accessible from your Google My Business dashboard. They appear in your local Knowledge Panel in Google. If Google really is becoming your home page, but also giving you a say in providing potential clients with information about your firm, you will definitely want to take advantage of it.
The article explains how to best leverage each tool. For example, Messages let you incorporate text messages into your Knowledge Panel; Ramsey notes that customers prefer using text messages to resolve customer service issues. Questions and Answers will build an FAQ-like dialogue for the panel, while optional URLs allow clients to schedule appointments right from the results page. Ramsey predicts it should take about an hour to set up these tools for any given law firm, and emphasizes it is well worth that investment to make it as easy as possible for potential clients to get in touch.
Cynthia Murrell, November 8, 2017
The Power of Search: Forget Precision, Recall, and Accuracy of the Items in the Results List
November 3, 2017
Thank you, search engine optimization. I now have incontrovertible proof that search which is useful to the user is irrelevant. Maybe dead? Maybe buried?
Navigate to “70 SEO Statistics That Prove the Power of Search.” Prepare to be amazed. If you actually know about precision and recall, you will find that those methods for evaluating the efficacy of a search system belong in the grave.
The “power of search” is measured by statistics presented without silliness like sample size, date, confidence level, etc. Who needs these artifacts from Statistics 101?
Let’s look at four of the 70 statistics. Please, consult the original for the full listing which proves the power of search. I like that “proves” angle too.
First, users don’t do much research. Here’s the statistic which proves the assertion “Online users just take what the system serves up”:
75% of users never click past the first page of search results.
So if you, your product, your company, or your “fake news” item does not appear at the top of a search result list or an output determined by a black box algorithm, you, your product, your company, or your “fake news” item does not exist. How’s that grab you?
Second, users are not too swift when it comes to figuring out what’s content and what’s an ad. Amazing assertion, right?
55% of searchers don’t know which links in the Search Engine Results pages are PPC ads, according to a new survey. And up to 50% of users shown a Search engine Results page screenshot could not identify paid ads.
If one can’t figure out what’s an ad, how many users can figure out if a statistic, like those which prove search is powerful, can differentiate accurate information from hogwash?
Third, search results mean trust. Sound crazy to you? No. Well, it sure does to me. Here’s the statistic that proves search eats Wheaties:
88% of consumers trust online reviews as much as they trust personal recommendations.
I believe everything I read on the Internet, don’t you?
Third, if you blog, prepare to be inundated with sales calls and maybe money. Here’s the statistics which prove that search has power:
Companies who blog have 434% more indexed pages than those who don’t. That means more leads!
I would suggest that if you company engages in hate speech, certain product sales, or violates terms of use—you will have to chase customers on the Dark Web or via i2p. By the way, I think a company is a thing, so “which” not “who” seems more appropriate. Don’t y’all agree?
Fourth, using pictures is a good thing. Hey, who has time to read? This statistic conflicts with “longer articles are better” but I get the picture:
The Backlinko study also reported that using a single image within content will increase search engine rankings.
Here’s a picture to make this write up more compelling:
Search has power. Really?
Stephen E Arnold, November 3, 2017
Yet Another Way to Make Search Smarter
November 3, 2017
Companies are always inventing new ways to improve search. Their upgrades are always guaranteed to do this or that, but usually they do nothing at all. BA Insights is one of the few companies that offers decent search product and guess what? They have a new upgrade! According to their blog, “BA Insight Makes Search Smarter With Smarthub.” BA Insight’s latest offering is called the Smarthub that is specifically designed for cognitive search. It leverages cloud-based search and cognitive computing services from Google, Elastic, and Microsoft.
Did I mention it was an app? Most of them are these days. Smarthub also supports and is compatible with other technology, has search controls built from metadata, machine learning personalization analytics, cognitive image processing, and simultaneous access to content from over sixty enterprise systems. What exactly is cognitive search?
‘Cognitive search, and indeed, the entire new wave of cognitive applications, are the next leap forward in information access. These apps rest on a search backbone that integrates information, making it findable and usable. Companies such as BA Insight are now able to not only provide better search results, but also uncover patterns and solve problems that traditional search engines can’t,’ said Sue Feldman, Co-Founder and Managing Director at the Cognitive Computing Consortium. ‘There’s a cognitive technology race going on between the big software superpowers, which are developing platforms on which these applications are built. Smart smaller vendors go the next mile, layering highly integrated, well designed, purpose-built applications on top of multiple platforms so that enterprises can leave their information environments in place while adding in the AI, machine learning, and language understanding that gets them greater, faster insights.’
It sounds like what all search applications are supposed to do. I guess it is just a smarter version of the search applications that already exist, but what makes them different is the analytics and machine learning components that make information more findable and personalize the experience.
Whitney Grace, November 3, 2017