Press "Enter" to skip to content

A Hacky Way to Scrape Linkedin

Recently, I was tasked with building a Chrome extension that scrapes LinkedIn profiles to gather important information such as name, bio, location, follower count, and connection count from a list of predetermined URLs. Once collected, this data needs to be saved to a database using a POST request with Sequelize as the ORM. This is fairly simple and straightforward to do.

  1. Set up a SQL database and create a  User model using Sequalize.
  2. Create an Express server.
  3. Develop the Chrome Extension.

I was expecting this to be very simple and thought it wouldn’t take more than 30-40 minutes. I developed the backend for this as you would do with any other web dev project and this was probably the easiest task. Developing the chrome extension is what took the longest. I’ll walk you through my thought process and what mistakes I made while developing this. I will focus more on the scraping part and the Chrome Extension in this article and not the creation of the REST API using Node and Express. You can read more about how to create REST APIs using Node.js and Express here.

Project Structure

.
├── Backend
|   ├── Models
|   |   └── User.js
|   ├── node_modules
|   ├── .env
|   ├── .gitignore
|   ├── db.js
|   ├── server.js
|   ├── package.json
|   └── package-lock.json
└── Extension
    ├── manifest.json
    ├── index.html
    ├── icon.png
    ├── style.css
    ├── script.js
    ├── content.js
    └── background.js

Models/User.js

import { DataTypes } from "sequelize";
import sequelize from "../db.js";

const User = sequelize.define('User', {
    id: {
        type: DataTypes.INTEGER,
        primaryKey: true,
        autoIncrement: true,
      },
      name: {
        type: DataTypes.STRING,
        allowNull: false,
      },
      url: {
        type: DataTypes.STRING,
        allowNull: false,
      },
      about: {
        type: DataTypes.STRING,
        allowNull: true,
      },
      bio: {
        type: DataTypes.STRING,
        allowNull: true,
      },
      location: {
        type: DataTypes.STRING,
        allowNull: true,
      },
      followerCount: {
        type: DataTypes.NUMBER,
        allowNull: false,
      },
      connectionCount: {
        type: DataTypes.NUMBER,
        allowNull: true,
      },
}, {
    tableName: 'users',  
    timestamps: false,
});

export default User;

./Db.js

import dotenv from 'dotenv';
import { Sequelize } from "sequelize";

dotenv.config();

const sequelize = new Sequelize(process.env.DB, process.env.USER, process.env.PASSWORD, {
  host: 'localhost',
  dialect: 'mysql',
})

async function testConnection() {
    try {
      await sequelize.authenticate();
      console.log('Connection has been established successfully.');
    } catch (error) {
      console.error('Unable to connect to the database:', error);
    }
  }
  
testConnection();
  
export default sequelize;

./server.js

import cors from 'cors';
import dotenv from 'dotenv';
import express from 'express';
import User from './models/User.js';

dotenv.config();
const app = express();
app.use(express.json());
app.use(cors());

const port = process.env.PORT || 3000;

app.post('/getinfo', async (req, res) => {
    try {
        const { name, url, about, bio, location, followerCount, connectionCount } = req.body;
        const newUser = await User.create({
            name: name, url: url, about: about, bio: bio, location: location, followerCount: followerCount, connectionCount:connectionCount,
        })
        res.status(200).send(newUser);
    } catch (err) {
        console.log(err.message);
    }
})

app.get('/ping', (req, res) => {
    res.send('pong');
})

app.listen(port, () => {
    console.log(`Server is running on port ${port}`);
  });

Next I had to figure out how to do basic stuff using Chrome Extensions. Since I had not much experience in this to begin with I spent some time learning this.

So as you know that we can have content_scripts in our extensions which will inject javascript scripts in the client’s browser, which will run on all websites which match with the listed matches in your  manifest.json. So let us break down the task of creating this Chrome Extension into smaller chunks:

  1. Get list of linkedin profile urls from a popup.
  2. Open each url from the list of urls.
  3. Get user info from each linkedin page.
  4. Perfom a POST request to our backend server to save it to the database.

The first chunk is actually very easy. Create a manifest.json and add the following to it.

{
  "manifest_version": 3,
  "name": "Linkedin Data Extractor",
  "description": "A chrome extension for a TechKnowHow article.",
  "version": "1.0",
  "action": {
    "default_popup": "index.html",
    "default_icon": "icon.png"
  },
  "permissions": [
    "activeTab",
    "scripting"
  ],
  "content_scripts": [
    {
      "matches": ["https://www.linkedin.com/in/*"],
      "js": ["content.js"],
      "run_at": "document_idle"
    }
  ],
  "background": {
    "service_worker": "background.js"
  }
}

Note the permissions, content_scripts, and background scrips. Now create index.html and style.css and append the follwing to it.

<!DOCTYPE html>
<html>
<head>
  <title>LinkedIn Data Extractor</title>
  <link rel="stylesheet" href="style.css">
</head>
<body>
  <h1>Extract LinkedIn User Data</h1>
  <textarea id="linkedinUrls" rows="5" placeholder="Paste LinkedIn profile URLs (minimum 3)"></textarea>
  <button id="extractButton">Extract Data</button>
  <script src="script.js"></script>
</body>
</html>
body{
    width: 300px;
    display: flex;
    flex-direction: column;
    gap: 10px;
    align-items: center;
}
textarea{
    width: 95%;
}
button{
    height: 40px;
    width: 100px;
    background-color: rgb(149, 205, 65);
    color: black;
    border: 0.5px solid black;
}

Now let us write the logic for getting the linikedin urls and openinng them in individual tabs one by one in script.js.

document.getElementById('extractButton').addEventListener('click', async () => {
    const urls = document.getElementById('linkedinUrls').value.split('\n').filter(url => url.trim());
    if (urls.length < 3) {
      alert('Please provide at least 3 LinkedIn profile URLs');
      return;
    }
    for (const url of urls) {
        await chrome.tabs.create({ url }); 
    }
});

okay so what is exactly happening here? We’ve added an event listener to the button and we are getting the urls and if they’re less than 3 then we are throwing an alert saying Please provide at least 3 LinkedIn profile URLs. And if we get more than 3 urls then we are opening them one by one in new tabs using chrome.tabs.create(). While creating Chrome Extensions to perform any task which deals with getting stuff from the webpage or opening new tabs, etc we need to use the chrome API which enables us to perform all these operations. You can find more stuff on Chrome APIs here.

With that we are able to open the urls one by one now we just need to get the info from the pages. Okay so my first instinct was to just document.querySelector(classname)and get the details. So I tried to do that for just the name at first to see if it works. And it worked! So I continued with the same approach to get the details of all the fields. Now here’s when I encountered a problem. I was able to get name, location, bio but null was being returned for all the remaining stuff, and I just couldn’t figure out why. If I was getting null while retreiving all the fields I would’ve realized what the problem was much earlier but since I was facing this issue only for half the things it took me a considerable amount of time to understand where I was going wrong.

My debugging process

So since I was running a content_script it was being run in the client’s browser so I logged everything into the user’s console like so.

// content.js

const nameElement = document.querySelector("h1").innerText;
const bioElement = document.querySelector(".text-body-medium").innerText;
const locElement = document.querySelector(".text-body-small.inline.t-black--light.break-words").innerText;
const followersElement = document.querySelector('.artdeco-card .rQrgCqdAxxLhIAcLhxdsifdagjxISOpE span').innerText;
const aboutElement = document.querySelector(".JSzNEGEyfojpwDGomLCFeVXPtVfgJfKE span").innerText;
const connectionElement = document.querySelector(".OnsbwwsPVDGkAkHfUohWiCwsWEWrcqkY").innerText;

console.log(nameElement);
console.log(bioElement);
console.log(locElement);
console.log(followersElement);
console.log(aboutElement);
console.log(connectionElement);

Only name, bio & locaion were logged and for everything else I got the following error.
Cannot get innerText for null, which basically means that the html tag that we were trying to locate was not found.

What was the problem?

Whenever we reload the page takes some time to load. If we try to access these html tags before they are even generated we will be returned null. So why were we able to get name, bio & location?

This was probably because the class names for name, bio & location do not change, on the contrary the class names for the remaining tags seem to be random and probably generated instead of defined. These classnames might be different at the time you read this but the class names for name, bio & location would still be same.

How do we fix this?

Since the deadline for this task was right around the corner I came up with a hacky solution. I tried to get the tag every 100 miliseconds for 10 seconds. If I did not get the tag after 10 seconds then I simply threw an error. Lets look at the code for that.

function waitForElement(selector, timeout = 10000) {
    return new Promise((resolve, reject) => {
        const interval = 100;
        const endTime = Date.now() + timeout;
        const check = () => {
            const element = document.querySelector(selector);
            if (element) {
                resolve(element);
            } else if (Date.now() < endTime) {
                setTimeout(check, interval);
            } else {
                reject(
                    new Error(
                        `Element with selector "${selector}" not found within ${timeout}ms`
                    )
                );
            }
        };
        check();
    });
}

Now instead of using document.querySelector() use await waitForElement().

function waitForElement(selector, timeout = 10000) {
    return new Promise((resolve, reject) => {
        const interval = 100;
        const endTime = Date.now() + timeout;
        const check = () => {
            const element = document.querySelector(selector);
            if (element) {
                resolve(element);
            } else if (Date.now() < endTime) {
                setTimeout(check, interval);
            } else {
                reject(
                    new Error(
                        `Element with selector "${selector}" not found within ${timeout}ms`
                    )
                );
            }
        };
        check();
    });
}

async function extractLinkedInData() {
    try {
        const nameElement = await waitForElement("h1");
        const bioElement = await waitForElement(".text-body-medium");
        const locElement = await waitForElement(
            ".text-body-small.inline.t-black--light.break-words"
        );
        const followersElement = await waitForElement(
            '.artdeco-card .rQrgCqdAxxLhIAcLhxdsifdagjxISOpE span'
        );
        const aboutElement = await waitForElement(
            ".JSzNEGEyfojpwDGomLCFeVXPtVfgJfKE span"
        );
        const connectionElement = await waitForElement(
            ".OnsbwwsPVDGkAkHfUohWiCwsWEWrcqkY "
        );

        const followerCount = parseInt(
            followersElement.innerText
                .replace(/,/g, "")
                .replace(" followers", "")
        );

        const user = {
            name: nameElement.innerText,
            bio: bioElement.innerText,
            location: locElement.innerText,
            followerCount: followersElement.innerText
                .replace(/,/g, "")
                .replace(" followers", ""),
            about: aboutElement.innerText,
            url: document.URL,
        };

        const connectionText = connectionElement.innerText.replace("\n", "");
        if (/\d+([+])? connections/.test(connectionText)) {
            user.connectionCount = 501;
        } else if (/\d+([+])? connections/.test(connectionText)) {
            const connectionCount = parseInt(
                connectionText.replace(" connections", "")
            );
            user.connectionCount = connectionCount;
        }
        console.log(user);
        chrome.runtime.sendMessage({ type: "user_data", data: user });
    } catch (error) {
        console.error("Error fetching elements:", error);
    }
}

extractLinkedInData();

Now the only thing that is left to do is to perform a POST request. I tried to do that just after getting all the tags in the extractLinkedInData()just to realise that you cannot do things like perform API calls or HTTP requests from the client console on some pages. LinkedIn does not allow this, so we have to figure out another way to do this.

We send a runtime message using Chrome API and listen for this in background.js and perform the POST request from there.

// background.js

chrome.runtime.onMessage.addListener(function (message, sender, sendResponse) {
  if (message.type === "user_data") {
      fetch("http://localhost:5555/getinfo/", {
          method: "POST",
          headers: { "Content-Type": "application/json" },
          body: JSON.stringify(message.data),
      })
          .then((response) => response.json())
          .then((result) => {
              console.log("Success:", result);
          })
          .catch((error) => {
              console.log("Error:", error);
          });
  }
});

And with that we’re done!

The last part where we performed the POST request from background.js rather than content.js was similar to the problems I faced in a pervious article on electron.js where certain scripts do not have access to node packages. That article covers how to read data from the serial port in a desktop application.
If you know a better way to do something similar let me know in the comments👇🏻

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *