Blog Image

Apache Solr Part-1  

Solr is an open source enterprise search platform, written in Java, from the Apache Lucene project. It helps us in finding the required information from a large data source like Hadoop. It's capability don't stop with searching, it can be used for storage purpose like NoSQL databases and thus understood that it is non-relational data storage and processing technology.

Solr is widely used for enterprise search and analytics use cases and has an active development community and projects. It runs on a standalone Search server and uses Lucene (a Java Search Library) as its core and has APIs like REST, HTTP/XML, JSON which makes it accessible from other distributed applications which are developed in various programming languages.

Felt stereotypic.. ha ha.. Lets make it simple..

Lets say you are preparing some research notes an wanted to refer books from university central library. Got the book you want, lets say "Advanced Algorithm Design" and wanted to read "Pattern matching Algorithm". What is the first thing you do.. open the index of the book to know on which page your topic is in and directly open the page to read.

Its something similar Solr does.. it stores the information you give to it based on a reference called index and fetch back on the index when you search. So, Solr works on the concept of indexing.

Features of Solr:

Lets take a look at what Apache Solr is capable of doing..
  • Document parsing (to store)
  • Full-Text search using Lucene
  • Text highlighiting
  • JSON, XML, PHP, Ruby, Python, XSLT, Velocity and custom Java binary output formats over HTTP
  • Distributed search through Sharding 
  • Geo-spatial search
  • Cluster based search results and so on..
More over it is flexible and designed in Open for extension pattern which makes developers to customise or extend Solr's components.

Architecture:

This how your search request will be processed to give result..
  • Request Handling: The request we send to Solr will be taken care by the Request Handlers. There will be different Request handlers to receive different types of requests. We need to properly choose the appropriate request handler by mapping it to the end-point URI (/select, /update) to get our request resolved.
  • Search: Solr is equipped with multiple Search components like spell-checking component, highlighting component, query component etc,. to perform searching with the help of Search handlers.
  • Query Parsing: After a quick syntactical check, Solr parses the queries we pass to it into the language the underlying Lucene understands.
  • Response Writing: Once search is completed, Response Writing components will format the result into one of the formats it supports like XML, JSON, CSV, etc.

Get Solr..

Now, lets get our hands dirty with Solr.. Download Solr form here..

 NOTE: As Apache Solr is open-source technology, even source code (solr-xxx-src.zip) will be available to download. As we just want to use it, make sure you download solr-xxx.zip.

Lets get Solr up and run..

Once you download, unzip and copy the folder to some other location (better not to leave in ../Downloads.) Lets say now it is in D:/varun/tools/ApacheSolr
  1. Launch command prompt (Terminal in case of Mac)
  2. Navigate to the folder
     cd D:/varun/tools/ApacheSolr/solr-7.3.1
  3. Now start the Solr Server using the start command
     bin/solr start 
    after printing some verbose.. finally it says
     Waiting up to 180 seconds to see Solr running on port 8983
    Started Solr server on port 8983 (pid=8248).Happy searching!
    There you go.. You have successfully started Solr Server..
  4. Now navigate to localhost:8983/solr (Solr Admin ) on browser and you should be shown the following screen

Now that you have started the Solr Server, you need to know some basic commands of Solr to operate it.
  • By default solr uses port 8983, if you want to start the server on some other port, pass on the port number as
    bin/solr start -p 4567
  • To stop the server use
    bin/solr stop
    resulting in
    ending stop command to Solr running on port 4567 ... waiting 5 seconds to 
    allow Jetty process 6035 to stop gracefully.
    means saying good bye to you.
  • Sometime you may want to restart the server. Then execute
    bin/solr restart
    server should restart

For any help , use
bin/solr -help
it shows different commands you could use
Usage: solr COMMAND OPTIONS where COMMAND is one of: start, stop, restart, status, healthcheck, create, create_core, create_collection, delete, version, zk, auth, assert, config
Standalone server example (start Solr running in the background on port 4567):
bin/solr start -p 4567
  
SolrCloud example (start Solr running in SolrCloud mode using localhost:2181 to connect to Zookeeper, with 1g max heap size and remote Java debug options enabled): 
bin/solr start -c -m 1g -z localhost:2181 -a "-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=1044"

Pass -help after any COMMAND to see command-specific usage information,such as: bin/solr start -help or bin/solr stop -help


If you observe, the highlighted region of above screenshot says, 'No cores available'. So, what are cores? to understand that, let us get familiar with the terminology..
  • Instance is Solr server instance (just like our tomcat instance). We can see the instances in Solr home directory. Solr instance runs in JVM and an instance contains one or more cores.
  • core is a unit in instance into which you store the indexes. Instead if maintaining multiple instances of Solr servers, we'll have one or more cores in an instance.
  • document is the basic unit of information in Solr, which is set of data that describes it. Say, person is the document.. name, age, gender, height etc,. are the set data that describes the person.
  • fields are the ones that describe the document. In the above example, name, age, gender, height are the fields.
  • field type is something that says about the type of data in the field. If you properly mention the field type, like name is a string, age of type int, it make Solr easily search and get the results.
When you add a document, Solr takes the information in the document's fields and adds that information to an index. When you perform a query, Solr can quickly consult the index and return the matching documents.


Let's feed Solr..

Along with data, Solr take some configuration as well. Following is the configuration.

  • solr.xml is the configuration file, which carries information about the cores. Solr takes the help of this file to load / identify the cores.

  • schema.xml is the file which has the fields and field types with the complete schema. This is also named as managed-schema.xml

  • solrconfig.xml carries core specific configurations like request-handlers(process your requests to Solr), response formatting and lot many.

  • data-config.xml is required when you are pulling data from databases. In this file you need to configure dataSource (URL, username, password), entities(tables) and field(columns) in it.


    Now let us give the content.. We can call this process as indexing.. Which makes the content searchable.


A Solr index can accept data from multiple different sources including XML, CSV, Word document, PDF and the data from databases as well.


Now to post the data from documents into Solr. you can find the sample data at ../../example/exampledocs directory

On Linux / Mac,

 bin/post -c techproducts example/exampledocs/*
* means all the files in the given directory

On Windows,

 java -jar -Dc=techproducts -Dauto example\exampledocs\post.jar example\exampledocs\*

You should see some verbose saying POSTing the file one after another..

Let's Search now..

We can search via REST clients, POSTMAN on chrome, curl etc,. Also with the native clients for many programming languages.

As of now we will explore Solr Admin UI. if you see, the core has loaded with techproducts. (if not loaded, click on the dropdown, and choose) and click on "Execute Query" button, you'll get 10 results(default) displayed in JSON format.


In this way, we can search for the information by changing the query in "q" field. We can search for : Single term (inStock:true), particular field (cat:electronics), combining the query (+electronics - music), this means results with electronics and without music.

You can explore by changing the query.


Thats all for now.. Hope this blog gave you atleast basic understanding of Apache Solr. In the upcoming blog, we'll see how to get the data from database and we'll explore Client API.


Till then, Happy Searching !!

About author

User Image
VarunM

A Computer Science Graduate, now a Software Developer based on Java and J2EE platforms. I am passionate about Java and its related technologies and always curious about whats going on with them. Eager to learn something new through exploration to get the crux of it.

2

-Comments

Be the first person to write a comment for this Blog
  • author image
    by:
      Satish15
      17 Days ago

    Thank you so much Varun. This blog helps me a lot in understanding what is Solr and how to run the Solr ,when i was new to Solr.

  • author image
    by:
      Sriman
      25-9-2018 11:21:08 AM

    Super and a very nice blog to have a quick and good start on Solr, Thanks!

Load More

No More Comments

Leave a Comment

Your comment has been posted and will appear soon.