Friday, October 23, 2015

6 common misconceptions when doing advanced Google Searching

As librarians we are often called upon to teach not just library databases but also Google and Google Scholar.

Unlike teaching other search tools, teaching Google is often tricky because unlike library databases where we can have insider access through our friendly product support representative as librarians we have no more or no less insight into Google which  is legendary for being secretive.

Still, given that Google has become synonymous with search we should be decently good at teaching it.

I've noticed though, often when people teach Google, particularly advanced searching of Google, they fall prey to 2 main types of errors.

The first type of error involved not keeping up to date and given the rapid speed that Google changes, we often end up teaching things that no longer work.

The second type of error is perhaps more common to us librarians. We often carry over the usual methods and assumptions from Library databases expecting them to work in Google when sadly they don't.

It is very difficult to detect both types of errors because Google seems to be designed to fail gracefully, for example it may simply silently ignore symbols you add that don't work.

Also the typical Google search brings back estimated count of results. e.g. "about" X million so it's hard to see if your search worked as expected.

As I write this blog post in Oct 2015, what follows is some of the common errors and misconceptions I've seen about searching in Google while doing research on the topic. Some of the misconceptions I knew about, a few surprised me. Of course by the time you read this post,  a lot is likely to be obsolete!

The 6 are

  • Using depreciated operators like  tilde (~) and plus (+) in search strings
  • Believing that all terms in the search string will definitely be included (in some form)
  • Using AND in search strings works
  • Using NOT in search strings works
  • Using asterisk (*) as a character wildcard or truncation  in search strings works
  • 6. Using parenthesis (  (    ) ) in search strings to control order of operators works

1. Using depreciated operators like  tilde (~) and plus (+) in search strings

As of writing these are the list of operators supported by Google, anything else is probably not supported, so if you are teaching people to use tilde (~) , or plus operator (+) please stop.

About tilde (~)

Karen Blakeman explains here what it used to do.

"Although Google automatically looks for variations on your terms, placing a tilde before a word seemed to look for more variations and related terms. It meant that you didn’t have to think of all the possible permutations of a word. It was also very useful if you wanted Google to run your search exactly as you had typed it in except for one or two words.

The Verbatim option tells Google to run your search without dropping terms or looking for synonyms, but sometimes you might want variations on just one of the words. That was easily fixed by placing a tilde before the word"

However as of June 2013 tilde (~) no longer works. (See official explanation).

About plus operator (+)

Another discontinued operator often still taught is the plus (+) Operator.

The plus operator used to force Google to match against the exact search term as you typed them. In other words,  "It turned off synonymization and spell-correction".  So for example if you searched +library , it would match library exactly and wouldn't substitute it for libraries or librarians for example.

However as of Oct 2011, it no longer works. (See official explanation)

According to Google help page, the plus operator is now used for Google+ pages or Blood types! (It generally can see the plus at the end eg C++ etc.)

If you wanted to force exact keywords you should add quotes around even single words. Eg. "library"

Of course we librarians know double quotes also have another purpose, they force words to be in an exact phrase say "library systems" . This works in Google as per normal.

Interesting enough in the latest Google Power Searching course (September 2015), Daniel Russell, mentions that you can do quotes within quotes to combine phrase searching with exact search around a single word.

For example he recommends searching "daniel "russell" " (note the nested quotes) because "daniel russell" alone gets him results with Daniel Russel (note only one 'L')

Another option if you want as near to as possible to what you typed in is to use the verbatim mode (which is kind of like + operator but for everything typed) 


As noted in the video above, even in that mode, the order of operations is not enforced, so you should use double quotes on top of verbatim mode for further control.

I believe even verbatim mode or using quotes around single words doesn't absolutely stop Google from occasionally "helping" by dropping search terms if including those search terms causes too many results to disappear - sometimes called  "Soft AND", more about that next.

2. Believing that all terms in the search string will definitely be included (in some form)

I've mentioned this before in the past, but Google practices what some call a "Soft AND", it will usually include all terms searched but occasionally one of the search terms will be dropped.

In the above Power Searching Video, Daniel explains that when you search for term1 term2 term3 you might find some pages with only term1 term2 but not term3. He states that some pages rank so highly on just term1 and term2 that Google will drop term3.

What's the solution? He recommends doing the intext operator. So for example term1 term2 intext:term3 , where the intext operator will force term3 to be on the page.

Note you can do phrase search together with intext as well, eg. intext:"library technology"

3.  Using AND in search strings

Believe it or not Google does not explicitly support the AND string in search.

For example neither the official google help or the official Google power searching course mention the AND operator!

Let me be clear, of course if you do something like library systems  , Google will do an implicit AND and combine the terms together (subject to the issue stated above).

But what I am saying is you shouldn't type something like library AND systems (whether AND, and, AnD, aNd etc) because at best it is ignored because it is too common (a stop word), though occasionally it may actually just search and match the word AND like a normal term!

To avoid such issues just drop the AND and do library systems

As an aside, OR works as per normal, and the power searching course states it's the only case sensitive operator.

4. Using NOT in search strings

Many of us Librarians are too used to literally typing NOT to exclude results. So for example we will automatically do libraries NOT systems ,not knowing this fails.

What you should do of course to exclude terms is to use the minus (-) operators. For example, try libraries -systems

5. Using asterisk (*) as a character wildcard or truncation in search strings

Another thing that doesn't work is that you can't find variant words of a search term by using * behind a string of letters.

For example the following doesn't work , organ* 

I believe Google automatically decides on stemming already so you don't need to do this to find words with the root of organ.

What works is something entirely different like this

a * saved is a * earned

The official guide says * is used as "a placeholder for any unknown or wildcard terms" , so you can match things like a penny saved is a penny earned where * can stand for 1 or more words.

But see tip 7 for interaction with site operator. 

6. Using parenthesis (  (    ) ) in search strings to control order of operators

This one is perhaps most shocking if you are unaware. When we combine AND with OR operators, a common question to ponder is, which operator has precedence?

My testing with various library databases shows that there is no one standard, some databases favour OR first others favour AND .

So it is a favourite trick of librarians to just cut through the complication and just use parenthesis to avoid having to memorise how it works in different databases.

So we love to do things like

(library AND technology) OR systems

First off we already said in #2 you shouldn't use AND in the search so let's try

(library technology) OR systems

But I am sorry to inform you that doesn't work too. In fact, the parenthesis is ignored , actually what Google sees is

library technology OR systems

Don't believe me? See here, here and here.

On Quora , a Google software engineer (search quality) says this

So what happens when you do something like library technology OR systems ?
In fact it's the equalvant of a library database search with library AND (technology OR systems)

It looks to me that OR has precedence which makes more sense to me than the other way around.

So what happens if you want (a b) OR (x y) ? Typing that out won't work in Google since it actually gives you a AND (b OR x) AND Y, but here's a complicated untested idea.

7. Bonus tips

Around operator

There is a semi-official operator known as the Around function. It allows you to match words that are within X words. This seems to be the same to a proximity operator without order.

So for example you can do

"library technology" AROUND(9) "social"

As noted by Dan Russell , AROUND needs to be in caps. For more details.

Combining asterisks with site operator

I guess everyone knows about the useful site: function . But did you know it works with wildcards as spotted here?

There's a lot more detail here that I recommend you read for interaction between wildcards and site operators. Combine it with the minus (-) operator for more fun!


As you can see while Google does generally support Boolean searching loosely (though it often does unexpected things like drop terms and may or may not include common words searched), the exact details are very different!

If you want to know more into the nuts and bolts of boolean operators in Google, I highly recommend

Thursday, October 15, 2015

Of full text , thick metadata , and discovery search

As my institution recently switched to Primo, nowadays I lurk in the Primo mailing list. I am amused to note that in many ways the conversation on it is very similar to what I experienced when lurking in the Summon mailing list. (One wonders if in time to come this difference might become moot but I digress).

Why don't the number of results make sense?

A common thread that occurs on such mailing lists from time to time and that often draws tons of responses is a game I call "Do the number of results make sense?".

Typically this would begin with some librarian or (technical person tasked to support librarians) bemoaning the fact that they (or their librarians) find that the number of results shown are not "logical".

For example someone would post a email with a subject like "Results doesn't make sense". The email would look like this (examples are made up).

a) Happy birthday    4,894
b) Happy birth*    3,623                                      
c) Happy holidays  20,591
d) Happy holid*    8,455
e) Happy OR birthday 4,323                                    

The email would then point out that it made no sense that number of results in b) and d) were lower than in a) and c) respectively. Or that e) Should have more results than a).

Other variants would include using quotes, or finding that after login (which usually produces more results due to results appearing from mutually licensed content providers) the number of results actually fell etc.

The "reason" often emerges that the web scale discovery service whether Summon Or Primo is doing something "clever" that isn't transparent to the user that results in a search that isn't strictly boolean logic.

In the past, I've seen cases such as

* Summon doing stemming by default but dropping it when boolean operators was used (might have changed now)
* Primo doing metadata search only by default but expanding to matching full text if the number of results dropping below a certain number.

I've discussed in the past How is Google different from traditional Library OPACs & databases?  and in this way web scale discovery services are somewhat similar to Google in that they don't do strict boolean and can do various adjustments to try to "help the user" at the cost of predictability and often transparency if the user wasn't given warning.

Matching full text or not?

In the most recent case I encountered in the Primo mailing list, it was announced there would be a enhancement to add a displayed message indicating that the search was expanded to match full text.

This lead to a discussion on why Primo couldn't simply match on full text all the time, or at least provide a option to do either like how EBSCO Discovery Services does.

MIT Libraries's Ebsco Discovery services searches in full text by default but you can turn it off.

An argument often made is that metadata match only, improves relevancy , in particular known item searching which makes up generally about 40-60% of searches.

For sure this makes relevancy ranking much easier since not bothering to consider matches in full text means the balancing act between ranking matches in full text vs metadata can be avoided.

In addition, unlike Google or Google Scholar, the discovery service index is extremely diverse including some content that is available in metadata only formats while others includes full text or are non text items (eg DVDs, videos).

Even if the items contain full text, they range from length in terms of a single page or paragraph to thousands of pages (for a book).

Not needing to consider this difference makes relevancy ranking much easier.

Metadata thick vs thin

Still a metadata match only approach ignores potentially useful information for full text and it's still not equally "fair", because content with "Thick metadata" still has a advantage over "Thin metadata".

I am not familiar with either term until Ebsco began to talk about it. See abstract below.

Of course "other discovery services" here refer mainly to Proquest's Summon (and Exlibris's Primo), which has roughly the same articles in the index but because they obtain the metadata directly from the publisher have limited metadata basically , article title, author, author supplied keywords etc.

While thick metadata would generally have controlled vocabulary, table of contents etc

The 4 types of content in a discovery index

So when we think about it, we can classify content in a discovery service index along 2 dimensions

a) Full text vs Non-full text
b) Thick metadata vs Thin metadata

Some examples of the type of content in the 4 quadrants

A) Thick Metadata, No Full text - eg. Abstracting & Indexing (A&I) databases like Scopus, Web of Science, APA Psycinfo etc, MARC records

B) Thick Metadata, Full text - eg. Ebsco databases in Ebsco Discovery Service, combined super-records in Summon that include metadata from A&I databases like Scopus and full text from publishers

C) Thin metadata, No Full text - eg Publisher provided metadata with no full text, Online video collections, Institutional repository records?

D) Thin metadata, Full text - eg Many publisher provided content to Summon/Primo etc.

What are the different ways the discovery service could do ranking?

Type I - Use metadata only - Primo approach (does expand to full text match if number of results falls below a threshold)

Type II - Use metadata and full text - Summon approach

Type III - Use full text mostly plus limited metadata - Google Scholar approach?

Type IV - User selects either Type I or II as an option - Ebsco Discovery Service approach

The Primo approach of mainly using metadata (and occasionally matching full text only if number of results are below a certain threshold) as I said privileges content that has thick metadata (Class A and B) over thin metadata (Class C and D) but is neutral with regards on whether full text is provided.

Still compare this with a approach like Summon that uses both metadata and full text. Here full text becomes important regardless of whether you have thin metadata or thick metadata it helps to have full text as well.

All things equal would a record that has thick metadata but no full text (Class A) rank higher than one that has thin metadata but has full text? (Class D).

It's hard to say depending on the algorithm used to weight full text vs metadata fields,I could see it going either way. Depends on the way things are weighted I can see it going either way.

My own past experience with Summon seem to show that there are times where full text matches seem to dominate metadata. For example searching for Singapore AND a topic, can sometimes yield me plenty of generic books on Singapore that barely mention the topic over more specific items. I always attributed it to the overwhelming match of the word "Singapore" in such items.

The fear that the mass of full text overrides metadata is the reason why some A&I providers are generally reluctant to be included their content in discovery services. This is worsened by the fact that currently there is no way to measure the additional benefit A&I's bring to the discovery experience, as their metadata once contributed will appear alongside other lower quality metatdata in the discovery service results.

If by chance the library has access to full-text via Open URL resolution, users will just be sent to the full text provider while the metadata contributed by the A&I database that contributed to the discovery of the item in the first place is not recognised and the A&I is bypassed. This is one of the points acknowledged in the Open Discovery Initative reports and may be addressed in the future.

In fact implementation of discovery services can indeed lead to a fall in usage of A&I databases in their native interfaces as most users no longer need to go directly to the native UI. Add the threat from Google Scholar, you can understand why A&I providers are so wary.

I would add that this fear that discovery services (except for Ebsco which already host content from A&Is like APA's PsychInfo) will not properly rank metadata from A&Is is not a theoretical one.

Ebsco in the famous exchange between Orbis Cascade alliance and Exlibris,  claims that

As you are likely aware, leading subject indexes such as PsycINFO, CAB Abstracts, Inspec, Proquest indexes, RILM Abstracts of Music Literature, and the overwhelming majority of others, do not provide their metadata for inclusion in Primo Central. Similarly, though we offer most of these databases via EBSCOhost, we do not have the rights to provide their metadata to Ex Libris. Our understanding is that these providers are concerned that the relevancy ranking algorithm in Primo Central does not take advantage of the value added elements of their products and thus would result in lower usage of their databases and a diminished user experience for researchers. They are also concerned that, if end users are led to believe that their database is available via Primo Central, they won't search the database directly and thus the database use will diminish even further.

Interestingly, Ebsco discovery service itself splits the difference between Primo and Summon and allows librarians to set the default of whether to include matching in full text or metadata only but allows users to override the default.

From my understanding default metadata only search in EDS libraries is pretty popular because many librarians feel metadata only searching provides more relevant results.

I find this curious because EBSCO is on record for stating that their relevancy ranking places the highest priority on their subject headings rather than title, as they are justly proud of the subject headings they have.

One could speculate EBSCO of all discovery services would weigh metadata more than full text, but librarians still feel relevancy can be improved by ignoring full text!

Content Neutrality?

With the merger of Proquest and Exlibris , we are now down to one "content neutral" discovery service.

One of the fears I've often heard is that librarians fear Ebscohost would "push up" their own content in their discovery service and to some extent people fear the same might occur in Summon (and now Exlibris) for Proquest items.

Personally, I am skeptical of this view (though I wouldn't be surprised if I am wrong either).  but I do note that for discovery vendors that are not content neutral, it's natural that their own content will have at the very least full text if not thick metadata while other content from other sources is likely to have poor quality metadata and possibly no full text unless efforts are taken to obtain them.

This itself would lead to their own content floating to the top even without any other evil doing.

To be frank, I don't see a way to "Equalize" everything , unless one ignores full text and also only ranks on a very limited set of thin metadata that every content has.

Ignoring metadata and going full text mostly?

Lastly while there are discovery services that rank based on metadata but ignore full text, it's possible but strange to think of a Type of search that is the exact opposite.

Basically such a system ranks only or mostly on full text and not on metadata (whether thick or thin)

The closest analogy I can think of for this is Google or Google Scholar.

All in all, Google Scholar I guess is a mix of mostly full text and thin metadata so this helps make relevancy ranking easier since we are ranking across similar types of content.

Somehow though Google Scholar still manages to do okay.... though as I mentioned before in
5 things Google Scholar does better than your library discovery service has a big advantage as

"Google Scholar serves one particular use case very well - the need to locate recent articles and to provide a comprehensive search." compared to the various roles library discovery services are expected to play including known item search of non-article material.


Honestly, the idea that libraries would want to throw well available data such as full text to achieve better relevancy ranking is a very odd one to me. 

That said we librarians also carefully curate the collections that are searchable in our discovery index rather than just adding everything available or free , so this idea of not using everything is not a new concept I guess.

Share this!

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Related Posts Plugin for WordPress, Blogger...