When it comes to web scraping, everybody says Python is the best language for it. Well, since I do not have any experience in Python, I will not comment on this. However, web scraping in Java is not as difficult as people think it is. In fact, people who are already familiar with the concept of regular expressions will have absolutely no difficulty in doing so in Java, since regular expression are the same regardless of which programming language you choose. Also, if you already have experience of web scraping in other languages, you will soon be able to do it in Java too. All it takes is knowledge of basic syntax and concept of loops. The other tools that dominate the web scraping domain in Java are Jsoup and Jaunt libraries.
Jsoup is a free and open source Java library that enables you to scrape and parse HTML from websites, files or even strings. DOM traversal is extremely simple in Jsoup. Form data submission for GET requests are very easy but it can be little tedious for POST requests, especially if there are a lot of data fields. Jsoup makes use of CSS selectors in order to select and extract data.
Just like Jsoup, Jaunt is also a Java library that allows you to scrape and parse HTML from websites, files and strings. It is also a free library but not open source. This library is free in the sense that you have to renew your license every month. Meaning, you will have to download a new version of Jaunt every month. You will also need to replace the old jar files with the new ones in your previous projects for them to work again. If you do not want this, there is also a paid version of Jaunt. Functionality wise, Jaunt can do almost everything that Jsoup can and more. Jaunt provides a facility to parse JSON and XML as well and also supports REST APIs. This is one of the major reasons why most people prefer Jaunt. For selection and extraction of data, Jaunt has its own syntax.
While Jaunt is more powerful than Jsoup, I prefer to stick with Jsoup. Jaunt's syntax is more readable than CSS selectors, but hey, we're programmers, we are used to reading codes and CSS selectors just look better to us. Having to download new jar files every month and replacing them in every project you have ever done is just not feasible. Yes, Jsoup cannot parse JSON or XML, but we can always combine Jsoup and regular expressions for those matter.
These are my thoughts on web scraping in Java. If you like to share your opinion, feel free to leave a comment.