CyberOSINT banner

Funding Granted for American Archive Search Project

September 23, 2015

Here’s an interesting project: we received an announcement about funding for Pop Up Archive: Search Your Sound. A joint effort of the WGBH Educational Foundation and the American Archive of Public Broadcasting, the venture’s goal is nothing less than to make almost 40,000 hours of Public Broadcasting media content easily accessible. The American Archive, now under the care of WGBH and the Library of Congress, has digitized that wealth of sound and video. Now, the details are in the metadata. The announcement reveals:

As we’ve written before, metadata creation for media at scale benefits from both machine analysis and human correction. Pop Up Archive and WGBH are combining forces to do just that. Innovative features of the project include:

*Speech-to-text and audio analysis tools to transcribe and analyze almost 40,000 hours of digital audio from the American Archive of Public Broadcasting

*Open source web-based tools to improve transcripts and descriptive data by engaging the public in a crowdsourced, participatory cataloging project

*Creating and distributing data sets to provide a public database of audiovisual metadata for use by other projects.

“In addition to Pop Up Archive’s machine transcripts and automatic entity extraction (tagging), we’ll be conducting research in partnership with the HiPSTAS center at University of Texas at Austin to identify characteristics in audio beyond the words themselves. That could include emotional reactions like laughter and crying, speaker identities, and transitions between moods or segments.”

The project just received almost $900,000 in funding from the Institute of Museum and Library Services. This loot is on top of the grant received in 2013, from the Corporation for Public Broadcasting, that got the project started. But will it be enough money to develop a system that delivers on-point results? If not, we may be stuck with something clunky, something that resembles the old Autonomy Virage, Blinkxx, Exalead video search, or Google YouTube search. Let us hope this worthy endeavor continues to attract funding so that, someday, anyone can reliably (and intuitively) find valuable Public Broadcasting content.

Cynthia Murrell, September 23, 2015

Sponsored by, publisher of the CyberOSINT monograph

A Search Engine for College Students Purchasing Textbooks

August 27, 2015

The article on Life Hacker titled TUN’s Textbook Search Engine Compares Prices from Thousands of Sellers reviews TUN, or the “Textbook Save Engine.” It’s an ongoing issue for college students that tuition and fees are only the beginning of the expenses. Textbook costs alone can skyrocket for students who have no choice but to buy the assigned books if they want to pass their classes. TUN offers students all of the options available from thousands of booksellers. The article says,

“The “Textbook Save Engine” can search by ISBN, author, or title, and you can even use the service to sell textbooks as well. According to the main search page…students who have used the service have saved over 80% on average buying textbooks. That’s a lot of savings when you normally have to spend hundreds of dollars on books every semester… TUN’s textbook search engine even scours other sites for finding and buying cheap textbooks; like Amazon, Chegg, and Abe Books.”

After typing in the book title, you get a list of editions. For example, when I entered Pride and Prejudice, which I had to read for two separate English courses, TUN listed an annotated version, several versions with different forewords (which are occasionally studied in the classroom as well) and Pride and Prejudice and Zombies. After you select an edition, you are brought to the results, laid out with shipping and total prices. A handy tool for students who leave themselves enough time to order their books ahead of the beginning of the class.

Chelsea Kerwin, August 27, 2015

Sponsored by, publisher of the CyberOSINT monograph

Library Design Improves

June 10, 2015

I like libraries. If you enjoy visiting them as well, navigate to “These Modern Libraries Look Like Alien Spaceships On The Inside.” Among the libraries featured are the Beinecke Rare Book and Manuscript Library (Yale), Bibliotheca Alexandrina, and Biblioteca España.

Stephen E Arnold, June 9, 2015

Reading in the Attention Deficit World

May 12, 2015

The article on Popist titled Telling the Truth with Charts outlines the most effective and simple method of presenting the information on the waning of book-reading among Americans. While the article focuses on the effectiveness of the chart, the information in the chart is disturbing as well, stating that the amount of Americans who read zero books in 2014 is up to 23% from 8% in 1987. The article links to another article on The Atlantic titled The Decline of the American Book Lover. That article presents an argument for some hope,

“The percentage of young folks reading for pleasure stopped declining. Last year, the NEA found that 52 percent of 18-24 year-olds had read a book outside of work or school, the same as in the pre-Facebook days of 2002. If book culture were in terminal decline, this is the demographic where you’d expect it to be fading fastest. Perhaps the worst of the fall is over. “

The article demonstrates the connection between education level and reading for pleasure, which may be validation for many teachers and professors. However, there also seems to be a growing tendency among students to read, even homework, without absorbing anything, or in other words, to skim texts instead of paying close attention. This may be the effect of too much TV or

Facebook, or even the No Child Left Behind generation entering college. Students are far more interested in their grades than in their education, and just tallying up the numbers of books they or anyone else read is not going to paint an accurate portrait. Similarly, what books are the readers reading? If they are all Twilight and 50 Shades of Grey, do we still celebrate the accomplishment?

Chelsea Kerwin, May 12, 2014

Sponsored by, publisher of the CyberOSINT monograph

Research Like the Old School

April 24, 2015

There was a time before the Internet that if you wanted to research something you had to go to the library, dig through old archives, and check encyclopedias for quick facts.  While it seems that all information is at your disposable with a few keystrokes, but search results are often polluted with paid ads and unless your information comes from a trusted source, you can’t count it as fact.

LifeHacker, like many of us, knows that if you want to get the truth behind a topic, you have to do some old school sleuthing.  The article “How To Research Like A Journalist When The Internet Doesn’t Deliver” drills down tried and true research methods that will continue to withstand the sands of time or the wrecking ball (depending on how long libraries remain brick and mortar buildings).

The article pushes using librarians as resources and even going as far as petitioning government agencies and filing FOIA requests for information.  When it makes the claim that some information is only available in person or strictly for other librarians, this is both true and false.  Many libraries are trying to digitize their information, but due to budgets are limited in their resources.  Also unless the librarian works in a top secret archive, most of the information is readily available to anyone with or without the MLS degree.

Old school interviews are always great, especially when you have to cite a source.  You can always cite your own interview and verify it cam straight from the horse’s mouth.  One useful way to team the Internet with interviews is tracking down the interviewees.

Lastly, this is the best piece of advice from the article:

“Finally, once you’ve done all of this digging, visited government agencies, libraries, and the offices of the people with the knowledge you need, don’t lose it. Archive everything. Digitize those notes and the recordings of your interviews. Make copies of any material you’ve gotten your hands on, then scan them and archive them safely.”

The Internet is full of false information.  By placing a little more credence out there, will make the information more safe to use or claim as the truth.

These tips are useful, even if a little obvious, but they however still fail to mention the important step that all librarians know: doing the actual footwork and proper search methods to find things.

Whitney Grace, April 24, 2015

Sponsored by, publisher of the CyberOSINT monograph

Worrying about Losing Obsolete Information

March 9, 2015

Ready to hear another side to the endangered library argument that has been tossed around since the 1990s? Hopes and Fears revives people’s worries about losing data from obsolete mediums and how libraries are evolving rather than disappearing in “The Near And Far Future Of Libraries.” The article points out the same old fears that some obsolete mediums have not been transitioned to a digital archive yet and they might be forgotten. It also mentions that libraries are transforming their spaces into gathering places for people to study, read, and meet (like that is new).

Mixed in with the fear of disappearing libraries, new ways that artificial intelligence is helping to preserve knowledge and help people learn how to harness their information is discussed. Some new insights about how libraries are changing are made, but the bulk of the article is very disorganized and is hard to tie together.

Some valid ideas made include that centralizing too much information on Web sites like Wikipedia, social media networks, and even the Internet Archive are dangerous, because one Web site is easier to block than hundreds. Another important advantage is that more interactive technology tools are actually helping people better use their information. Robots like Vincent and Nancy from Westport Library are an example of how people can better physically interact with information and use it to their advantage.

What is the most interesting archival idea presented is the Rosetta Disk, a thin nickel disk three inches in diameter that holds over 14,000 pages of information. While it is meant to preserve knowledge for ready access in the future it is also is good backup:

“We aren’t creating the Rosetta Disk specifically with an apocalypse in mind, or for a society that’s undergoing major upheaval, but over the span of millennia, I think you have to expect that to happen occasionally. In that case, the Rosetta Disk is a good long-term backup. You might think of it as a “secret decoder ring” for information we leave for the future in human language form.”

Libraries and information are changing. We do have to preserve obsolete knowledge before it degrades and we have to upgrade libraries for them to remain relevant. It is very similar to old historical sites with low visitor attendance. They are changing the way they interact with people and presenting their historical information to draw people to them. Do not be fearful, embrace the change.

Whitney Grace, March 09, 2015
Sponsored by, developer of Augmentext

Early English Texts Now Available Online

February 16, 2015

The phrase “early English literature” encompasses texts written from the mid-fifteenth century to 1700. Now, the University of Oxford’s Bodleian Libraries tells us about its exciting project to make such works available to anyone with Internet access in, “Thousands of Early English Books Released Online to Public by Bodleian Libraries and Partners.” The University of Michigan Library is also involved in the project, which will release some 25,000 texts. The fully searchable files can be downloaded in different formats or read online.

The works were compiled some time ago by the Early English Books Online Text Creation Partnership (EEBO-TCP), which spent 15 years manually entering and XML-encoding the texts. The results were made available to users of academic libraries at the time, but were released into the public domain at the turn of the new year. The post informs us:

“Members of the public, teachers and researchers around the world can now have access to thousands of transcriptions of English texts published during the first two centuries of printing in England. The corpus includes important works by literary giants like Chaucer and Bacon, but also contains many rare and little-known materials that were previously only available to those with access to special collections at academic libraries.

“The text-only files are a unique resource for members of the public to browse for curious and interesting topics and titles ranging from witchcraft and homeopathy to poetry and recipes. In addition to browsing and reading text-only versions of these early English books, users of EEBO-TCP can also search the entire corpus, which contains more than two million pages and nearly a billion words. The text has been encoded with Extensible Markup Language (XML), allowing individuals to search for keywords and themes across the entire collection of works, in individual books or even within specific sections of text such as stage directions or tables of contents.”

Michael Popham, head of the Bodleian Libraries’ digital collections, is excited about the full-search functionality. He expects the tool will allow users to make connections, cross-references, and discoveries unlike ever before.

Cynthia Murrell, February 16, 2015

Sponsored by, developer of Augmentext

Enterprise Search: Evidence It Is a Commodity

January 17, 2015

I was browsing through some information gathered by Overflight last week. I cam across an interesting page showing Libraries Australia Architecture Overview. Here’s a miniature of the diagram. The link provides a larger version. Where is search? Well, it is in the middle, represented by a purple storage icon.


The search system is Solr. I find this interesting for several reasons:

First, Solr replaced the Australian-developed TeraText search system, which I think is pretty good. TeraText was a commercial product, and Solr is an open source system.

Second, Solr is a component in a far larger system. No surprise here, but the diagram makes clear that search is a utility supporting many other library functions. For vendors who make search the fabric for a large-scale application, the Libraries Australia team may want you to give them a lecture about ways to improve their system.

Third, Libraries Australia has a number of systems, each of which presumably has its native search tools. The implication is that Solr provides one screen access to these diverse resources. I wonder if the Oracle DBA uses Solr instead of the native Oracle tools. My thought is that the Solr champions see no reason to fool with Oracle command lines. The DBA, on the other hand, may see information access from a different point of view.

Net net: A commercial account closes, and an open source account begins. Does this fact suggest that closing deals for proprietary search systems might be more difficult in 2015?

Stephen E Arnold, January 17, 2015

EU Decision on Digitization by Library

October 8, 2014

The European Court of Justice recently issued an interesting ruling. Intellectual Property Watch reports that “Libraries May Be Permitted to Digitise Books Without Copyright Owner’s Consent, EU High Court Rules.” The decision says libraries may digitize works to make them available at electronic reading stations, but draws the line at printing them out or copying them to a USB drive. The precipitating event seems to have occurred in Germany, where a university library refused to purchase an e-book and, instead, chose to place the book in its computer system without the publisher’s consent.

The EU copyright directive issued in 2001 does carve out an exception for libraries to make content available electronically. Unsure whether the above usage is covered by the exception, the German Federal Court of Justice consulted the EU court for clarification. Writer Dugie Standeford explains:

“The ECJ held that even if a rights owner offers a library a licence agreement for use of the work on appropriate terms, the library may take advantage of the exception, since it otherwise could not fulfil its core mission of promoting research and private study. The directive doesn’t bar governments from giving libraries the right to digitise books, and, if necessary, from making the material available on dedicated computers, the court said.

“But the right of communication which public libraries may hold doesn’t allow people to print out the works on paper or store them on USB sticks, because those are acts of reproduction which aim to create a new copy of the digital copy, the court said. Nevertheless, it added, member states may provide an exception or limitation that allows library users to print the works or store them on a USB stick, so long as compensation is paid to the rightsholder.”

Concerns have been raised about how compensation might be collected for such copies. On the other hand, the decision has been hailed as a win for libraries and archives. The question of online access is not specifically mentioned in the ruling, but it does limit the exception to “dedicated terminals.” As remote access to information becomes more and more standard, we may have another clarification to look forward to.

Cynthia Murrell, October 08, 2014

Sponsored by, developer of Augmentext

Good-Bye Court Documents

September 22, 2014

The Internet makes it easier to access information, including documents from the government. While accessing government documents might cost a few cents, it is amazing that the information can be accessed within a few mouse clicks. BoingBoing, run by the infamous Cory Doctorow, notes that five important US courts are removing their documents from the Internet in “As Office Of US Courts Withdraws Records For Five Top Benches, Can We Make Them Open?”

The court documents are housed on the PACER system, most notable for charging users ten cents a page to access information. Doctorow advocates for free information and stopping governments from spying on its citizens. It is not surprising that he supports reopening these documents, along with the Free Law Project, Internet Archive, and Public.Resource.Org.

The plea reads:

“Our judiciary is based on the idea that we conduct justice public, not in star chambers and smoke-filled back rooms. Our system of justice is based on access to the workings of our courts, and when you hide those workings behind a pay wall, you have imposed a poll tax on access to justice. Aaron [Swartz] and many others believed very deeply in this principle and we will continue to fight for access to justice, equal protection, and due process. These are not radical ideas and the Administrative Office of the U.S. Courts should join us in our commitment.”

Swartz is known for working against Internet censorship bills, so joining Doctorow and the others will get the right backers to make these documents available again. You can fight city hall and win, especially if you are a technology enthusiast with legal aid.

Whitney Grace, September 22, 2014
Sponsored by, developer of Augmentext

Next Page »