For many years, manual data entry in Excel (sourcing from books, as seen in this video) or manual copy-pasting from websites, was my only way of creating databases. A slow process which limited the size of the databases I could make. Even in this slow process I made about 40 databases in the fields of personal interest: automobiles, geography, real estate, computers, gaming, etc, from pure hobby.
Starting from 2015 I offer web scraping services… in ANY field. Scraping usually means writing a script that visit a list of given pages, copy specific data from each page and put it in a database automatically, allowing me to create large databases with minimum effort, in a matter of hours.
You need just to provide URL of the pages you want to extract data from, and tell me which pieces of data to extract.
Theoretically I can scrap data from any website, but only websites having the required data in a consistent structure from page to page, can produce a good usable database. After automatic scraping, less or more manual work is needed to make database usable.
Note: Since 2015 to 2017 I created over 50 custom databases via web scraping, and if all customers come back to ask for updates, will overload me. Scraping more websites, if they take too much time, will create additional workload and will delay everyone’s updates. As 2018 I decided to stop updating databases having less than 5 sales per year so I can focus on the ~20 best-selling databases that produce 80% of my income.
So… unless you come with a GREAT IDEA of database that can be sold to multiple customers, I have the right to NOT do your scraping project if it takes more than ~2 hours of manual work and more than ~50 hours of running scraper in background, and you are the single customer.
Simple data scraping service
This apply on websites where each item have own URL and all data is in page HTML code.
There are few tools available online, usually free download but limited in functionality, limited in one project at time, limited number of pages you can extract, limited pages per second, unless you upgrade to paid subscription, which is ridiculous expensive for one-time project (example: import.io). Although you can scrap yourself for free (small number of pages), may take few days to learn to use them efficiently. Most people do not have time to learn or cannot pay high monthly subscription. I can help you!
A friend programmer made in Visual Studio an universal scraping software comparable with the tools available online, but with no limit in number of pages or simultaneous projects, this allow me to scrap simple websites quickly at lower price that you can do yourself, spending max 30 min of manual work to write extraction codes.
Price: sum of the following 4 things:
Number of pages to be extracted: up to 1,000 pages = $50, up to 10,000 pages = $100, up to 100,000 pages = $300.
Number of columns (pieces of data to extract from each page): 50 cents for each column.
Complexity: $0 fee for websites where all items are accessible from an index page, extra $ fee if items are displayed with infinite scrolling, pagination, enter data in search boxes, etc.
Work after scraping: certain websites do not provide data in the format you need, I charge extra fee to manually arrange data in Excel.
Complex data scraping service
This apply on websites having drop-down lists, search boxes, JSON, data is behind login, or other actions are needed to display the data you want to extract. In this case online scraping tools do not work, my friend universal scraping software also do not work, so he need to make in Visual Studio a custom scraper just for that particular website, this may take few days depending by his available time.
Price: usually within $200 to $500 range which I share with my partner, price depends more by complexity of website than by number of pages to be extracted.
For less than 200 records may be faster to copy-paste manually than coding a custom scraping software.
Complex scraping services sometimes require screenshots (as below) to explain exactly where to click and what data to extract
What cannot be scraped
I know how useful is a phone or email database, for example if you are a car insurance company to spam emails to car owners posting listings in classifieds websites, but most classifieds websites protect seller phone number and contact email from being spammed with unsolicited emails, by using a Contact button, or need to click a button to reveal email, or email is shown in an image rather than text format. In this case scraper bot cannot pick data so you need to hire someone who do manual data entry, which will require large amount of time.
Anti-scraping features: some websites look simple to scrap, but after starting job I get IP blocked, CAPTCHA or other measures made either to prevent someone copying data, either to prevent DDOS attacks. If you ask for price before starting the job, you should be prepared for price changes if I find anti-scraping features and need to spend extra time looking for alternate ways to scrap.
Some websites (1%?) have strong anti-scraping features that cannot be overcome. Do not get angry at me if I fail scraping data from one website, just give me another website and I try again.
Advantages of my service
The main advantage of working with me is that once I create a database I can post on website to be purchased by multiple people, so you will pay just a small part of the cost of scraping (if database is something of my personal interest – cars worldwide, real estate of Singapore, and few more).
If you want to keep private, I can sell it just for you at higher price and not publish on website, but the BIG question is what I should do if a second customer ask me to scrap same website and he agrees to publish it on website to get cheaper price? I reserve the right to sell to other people if they ask. If you ask to scrap a website outside my personal interests and unrelated with the fields covered by website, I will not publish it because is unlikely for anyone else to purchase it, and you need to pay the full cost of scraping.
The databases published on website include FREE updates for one year, with higher update frequency for products with higher sale volume. But if you ask me to scrap a website privately “just for you” you need to pay for each update, price depending by how much time takes each re-scraping.
The scraper takes between 0.2 and 2 seconds to scrap each page, depending by website. So I may not able to do very large databases, for example if you want to scrap 1 million records with monthly updates. The limit of how many records I can scrap depends by how many customers I have in current month, although I am able to run multiple projects at same time.
Data scraping is legal or not?
Usually scraping is legal, but using scraped data in a public website may be illegal.
Depends… if the data is added by volunteers, or by sellers in classifieds websites, scraping is most likely legal. But if authors of website hardworked to compile data from sources like car brochures or manufacturer websites, scraping is most likely illegal, especially if you use their data in making your own website or other commercial purpose. Although data is freely available, compilation can be copyrighted. Most websites contains dummy data (example: a bunch of cars having +/- 1 horsepower than official value) and if you use data copied from them, they can prove that you copied their data compilation and make a lawsuit against you. BEWARE!
For a moment I became concerned if my European Car Models & Engines Database sourced from AutoKatalog books is a copyright violation, but I came in conclusion that it is fine, because my databases is an original compilation writing data in a different data structure than the book, and it target online audience, while the AutoKatalog is a book sold in shops targeting car hobbyists. I am doing each year over 100 sales without having a single person worrying about copyright.
In case of America, Year-Make-Model is my original compilation sourced from Wikipedia and 3 more websites, while Year-Make-Model-Trim-Specs is web scraping from Edmunds.com website who is also offering API thus allow other websites using their data, so again is legal.
But, since I created India car database in 2015 sourcing data from Carwale.com I started being concerned that what I am doing may be illegal.
Country matters: I had many customers in India asking me to scrap data from various websites. However, when someone from Europe or America ask me certain data that I do not have and I propose him scraping services from a website, most people bring attention to legal issues of web scraping.
Funny case: someone offered to sell me a car database that he claimed to have been creating it by working for 4 months, 8 hours per day, copy-pasting data from a website, with rights to resell on my website. From copyright point of view does NOT matter if you extracted data using an automatic software or typed every letter manually, as long you copied data from a website your work is not original. He was probably not aware of scraping software. If you wasted few months doing something that could have been done in few hours using scraping software, you are an IDIOT (I was an idiot too doing such jobs before 2015 being not aware of scraping software, but small jobs only) and I am still doing in case of European database because I source data from books (offline sources), making an original product on the web.
Example of data extraction / scraping projects done and their price
All scraping software save data in CSV format, but when it is about publishing on website, I save it as XLS and add borders, colors, headers and other visual features to match the style of other products “Made by Teoalida”.
India Car Database – source: www.carwale.com – Made in August 2015 from personal interest because of numerous people asking me about indian car database. Being my first scraping project, took initially about 7 days to figure out how to do it, and later doing it again in just in 1 day. Price: 30-120 euro.
India Bike Database – source: www.bikewale.com – Made in January 2016 after 2nd person requested a database of bikes sold in India. One of easiest projects, having no drop-down boxes but plain links to each bike page. 250 records, price: 25 euro.
Skyscrapers Buildings Database – source: www.emporis.com – Made in November 2015 from personal interest, put for sale for $150 (15000 buildings) and turned into a marketing failure, 1 year passed and nobody purchased it (except a customer asking me for make US buildings database, see below). Took about 20 hours to compile manually list of cities with buildings over 100 meters, then list of buildings from these cities, then used a software to automatically extract data about each building. 15000+ buildings. Emporis block my IP for 2 days if I access more than 3000 pages in one day, so data extraction with import.io (not able to change IP) was limited to 3000 buildings per day, which took about 1 hour daily for 6 days.
US Buildings Database – source: www.emporis.com – Made in November 2016 for a customer seeing above Skyscrapers database told me to make a similar databases with all types of buildings from USA, 160,000+ buildings, had to run my friend scraper in over 100 batches of max 2000 buildings, being scraping locally I could change IP after each batch, running again and again blocked URLs until I was able to get all buildings. 60 hours of work. Price: $600.
Singapore Condo Database II – source: www.propertyguru.com.sg – Made for a customer in 2016. Apparently an easy project, having plain links to all condos, it turned difficult because of a fucking CAPTCHA page appearing every 10 pages extracted. My programmer partner spend 2 weekends in Visual Studio making a custom app that allow me to input CAPTCHA when needed, charged me $300 USD, and I sold database with 3176 condos for $317.60 SGD (about 240 USD), leaving me in loss, until I sold it to a second customer.
World countries database – source: The World Factbook – Made in 2017 from personal interest, a database with an impressive amount of 362 columns and only 268 rows. Took about 5 hours to write codes for each column, and only 35 minutes to scrap data.
Sulekha.xls – source: www.sulekha.com – A bit unusual data scraping, an one-time use database for SMS and email marketing, instead of creating a saleable product containing all car models, all buildings, all of something.
Postal code scraping – a customer gave me a list of postal codes which I input in www.streetdirectory.com to get building name and street address (in Singapore every building have unique postal code).
Flickr scraping – a customer downloaded a large amount of car images from Flickr and realized that to use in his website he needs to specify author name, link to source page and link to Creative Commons license. I scraped this info, 223,000 images for 223 euro at 0.6 seconds per page.
Used cars images – a customer asked me to scrap an used cars website, to get image URL beside Make, Model, Year. Took only FEW HOURS and I got over 100.000 car images, all in same resolution. He told me to keep it private and do not publish or resell on website. So I am telling you only the idea. If anyone wants to scrap car images in this way, let me know what website to scrap!
I done few more databases but the customers told me to NOT publish on website, or they are in fields unrelated to topics covered by my website so even if published, they won’t get sales.