Behind the Scenes: How We Scraped 250 Thousand Election Candidate Data for 2024

Background

This project started from a tweet by Mas Gilang (@mgilangjanuar) on December 15, 2023. However, the data collection project was only completed on January 23, 2024. Many obstacles were encountered, which is why I want to write down the lessons learned from this project.

As an important note, this project is just a side project. It has no particular purpose other than the enjoyment of data scraping. Not a single rupiah was earned from this data collection.

Method

The method used is a simple fetch available in JavaScript.

const response = await fetch("http://example.com/movies.json");
const movies = await response.json();
console.log(movies);

Fetching was done through publicly accessible endpoints on the KPU website. Responses in JSON and HTML were collected and parsed to obtain clean data. That's it.

In more detail, there were three stages:

getListDapil: first, collect the list of electoral districts (dapil)
getListCalon: then fetch each electoral district to get the list of candidates in that area
getProfilCalon: finally, with the available names, perform a POST fetch to the endpoint to get candidate profiles

Problems

During scraping, many problems were encountered. From the scraper side to server-side issues.

Large Data

const list = [];
const resp = await fetch("https://example.com/api").then((res) => res.json());
list.push(resp);
await Bun.write("data.json", JSON.stringify(list));

The collected data was stored in an array. Then the array was saved to a JSON file.

As a note, I use Bun (not NPM, PNPM, or Yarn) because it's easier to use.

If the data were small, this wouldn't be a problem. However, in total, there were more than 250 thousand candidates. For a JSON file, that number would make the file large, so it would take a long time just to read and update the data inside.

From there, I switched to SQLite, using Drizzle as the ORM. Compared to JSON, SQLite as a real database makes select, insert, and update processes very easy.

Marking

Unfortunately, because there was so much data, not all fetch operations to get profiles were successful. The server had limitations in responding to requests.

There were 250 thousand candidates in this election. This means I needed to perform 250 thousand getProfilCalon fetches to get each candidate's profile. A very substantial number.

In this situation, it was impossible to do it with just one script run. Why? Because in the middle of the process, there would definitely be failed fetches, whether due to the server being overloaded with requests, or my network going down.

Therefore, I needed to work around tolerating failures. The way is, when the script is rerun, I would only perform getProfilCalon for candidates I haven't fetched yet.

How do you mark whether a candidate's profile has been fetched or not? This is where SQLite makes things easy.

When running the getListCalon function, I get a list of candidate names. For each name, I added an is_fetched column. If getProfilCalon succeeds, it will be set to true.

| nomor | nama          | is_fetched |
| ----- | ------------- | ---------- |
| 1     | John Doe      | true       |
| 2     | Jane Smith    | true       |
| 3     | Bob Johnson   | true       |
| 4     | Alice Brown   | true       |
| 5     | Charlie Davis | false      |
| 6     | Eve Wilson    | false      |

As a result, I could perform queries like this:

const listNotFetched = SELECT * FROM list_anggota WHERE is_fetched = false

From listNotFetched, we then run getProfilCalon to get detailed profiles from candidates.

Since we only query is_fetched = false, there's no need to fetch from the beginning; we only fetch profiles that haven't been obtained.

Command Separation

Still related to failure tolerance, it was impossible that when rerunning the script, I had to redo everything. For context, there are 4 categories of data collected:

DPD
DPR RI
DPRD Provinsi
DPRD Kabupaten Kota

To tolerate failures and not have to run everything from the start, I separated the commands to run the script.

const category = process.argv[2];
const command = process.argv[3];

switch (category) {
  case 'dpr':
    switch (command) {
      case 'get-list-dapil':
        dpr.getListDapil()
        break;
      case 'get-list-calon':
        dpr.getListCalon()
        break;
      default:
        console.log('dpr command not found');
    }
    break;
  ...
  default:
    console.log('Unknown command');
}

With the code above in index.ts, commands would differ according to needs.

For example, if I need to get the list of electoral districts at the DPRD Provinsi level, the command is 'bun run index.ts dprd-provinsi get-list-dapil'

Similarly, if you want to get the list of DPR RI candidates, 'bun run index.ts dpr-ri get-list-calon'

Batch Fetching

Remember again, there are 250 thousand fetches that need to be done. There are several options for this:

Default. For each fetch, we need to wait (await) for one fetch to complete before we can perform the next fetch. In other words, only one fetch can be done at a time, so this would be very slow. Assuming one fetch takes 2 seconds, multiplied by 250 thousand, that's 500 thousand seconds. And 500 thousand seconds equals 5.79 days. It would take a very long time. This is still an assumption for fetch time, while after data collection, there's a data processing phase that takes just as long.
Parallel. This means fetches are done simultaneously. Imagine 250 thousand fetches being fired at one server simultaneously, what would happen? Yes, it would crash.
A middle ground between fetching one by one and fetching all at once, which is batch fetching, or batching. This means we don't fire 250 thousand fetches at once, but divide them into groups. For example, one group consists of 100 fetches. So 100 fetches are fired, we wait for this group to finish, then we continue to another group and fire 100 fetches. This way, we're more server-friendly.

Method 3 is what was used. Fortunately, Mas @gadingnstn had created a library called Concurrent Manager, which functions as batching concurrent promises. As a result, I didn't need to write a script to handle this batching from scratch.

import ConcurrentManager from "concurrent-manager";

const concurrent = new ConcurrentManager({
  concurrent: 10,
  withMillis: true,
});

for (const calon of list) {
  concurrent.queue(() => {
    return doSomethingPromiseRequest();
  });
}

concurrent.run();

Anti-Bot

After problems from our side were resolved, there was another problem from the server side. When we perform, say, 50 fetches at once, the server would automatically block the IP for a certain period.

To overcome this, I did two things:

1. Delays Between Fetches

Initially, 1000 fetches were fired at once. The response became 403, meaning the IP was blocked by the server. Then fetches were reduced to 50 and delays of several seconds were added between fetches. This solution proved effective in avoiding server blocking.

export const sleep = (ms: number) =>
  new Promise((resolve) => setTimeout(resolve, ms));

export const sleepRandom = async (min: number, max: number) => {
  const delay = Math.floor(Math.random() * max) + min;
  await sleep(delay);
};

export const customFetch = async (url: string, options?: RequestInit) => {
  await sleepRandom(1_500, 3_000);
  const res = await fetch(url, { ...options });
  return res;
};

2. Transit Via Cloudflare Worker

Although blocking could be overcome by reducing the concurrent count to 50 and adding delays between fetches, a new problem emerged: it took longer.

Various methods were tried but failed. Until finally realizing that because this is IP-based rate limiting, I had to fetch through different IPs.

How? There are many ways. But the easiest (and free) way is to use Cloudflare Worker. I created an API to fetch to the KPU server.

The illustration is like this:

Before: User -> KPU Server
After: User -> Cloudflare Worker -> KPU Server

Simply put, Cloudflare Worker just fetches to the address I've specified, then returns the exact response it receives.

var src_default = {
  async fetch(request, env, ctx) {
    console.log("Incoming Request:", request);
    const targetURL = "https://infopemilu.kpu.go.id/Pemilu/Dct_dprprov/profile";
    const response = await fetch(targetURL, request);
    console.log("Response from Target:", response);
    return response;
  },
};

export { src_default as default };

Originally I fetched to https://infopemilu.kpu.go.id/Pemilu/Dct_dprprov/profile, then I just needed to fetch to transit-example.workers.dev

Because the request is made by the Cloudflare server, meaning the request uses Cloudflare's IP, my IP won't be blocked.

To speed up the scraping process, I created 3 Cloudflare Workers. The final scraping scheme became like this:

User -> KPU Server
User -> Cloudflare Worker 1 -> KPU Server
User -> Cloudflare Worker 2 -> KPU Server
User -> Cloudflare Worker 3 -> KPU Server

Closing

There are a thousand ways to impose restrictions, and a thousand and one ways to bypass them.

Finally, choose leaders who are willing to listen when criticized, not ones who silence... 😥