Recent Project Knowledge Notes

At the beginning of the year, I was forced to take on a development task. I have never done product development before, so I had to learn on the job. While searching on Baidu and using AI, I have made some progress recently. I want to jot down the knowledge points I have learned, although they may be fragmented, they might be useful.

Project#

I was assigned to develop a website usability monitoring engine. Since I am not proficient in other languages, I chose Python. Since I am most familiar with the Flask framework in Python, I started from there.

Posting data in Python#

Flask and many other Python frameworks have some "unconventional" (or maybe just unconventional to me) ways of writing code. For example, to write a route that accepts POST data, you need to write it like this:

@app.post("/test/api/add")
def ch(request: Request):
      pass

Otherwise, you can only receive data through GET requests.

Concurrency#

At the beginning, I was only told that it was a module, but later it turned out to be an engine, and gradually it became bigger and bigger. It almost became a one-person project. At this point, I realized that this engine will eventually be responsible for monitoring all projects in the company, with a conservative estimate of several thousand. I roughly estimated that there will be around 300 concurrent requests. So I wondered if Flask could handle such a high level of concurrency, and I decided to look it up. I found out that Flask is a development engine and is not recommended for use in production environments. I thought it might be because of high concurrency issues. Then I found out that FastAPI is suitable for high concurrency scenarios and is more suitable as an API "engine". So I decided to switch to FastAPI.

Simply switching to FastAPI doesn't actually optimize the performance. You need to add the "async" keyword to the functions and use asynchronous programming to optimize it.
When receiving data, you need to use the "await" keyword, otherwise you won't receive any data. I don't know why yet, I still need to learn about this.
All functions need to be marked as "async", otherwise it won't optimize the performance and may even have negative effects.
When using FastAPI, you need to install a web server separately. The official recommendation is to install uvicorn, but during installation, it is in the format of "uvicorn[xxxx]". This format means installing uvicorn and its dependencies.
When using asynchronous programming, you can no longer use "requests" library because it doesn't support asynchronous requests. I switched to aiohttp.
The most important point is that, in practice, FastAPI doesn't have much better performance optimization for concurrency compared to Flask.

Request Monitoring#

The upstream server assigns me tasks, and I need to monitor them.

Monitoring means receiving URLs pushed through GET requests. Here are a few points to note:

It is better to include user agent (UA) or similar information in the requests, similar to the strategies used in web scraping, to prevent interference from anti-scraping systems.
When monitoring websites, it is better to use GET requests. Some websites restrict the methods allowed. I thought about using HEAD requests, but it resulted in false positives. I speculate that it may be due to the configuration of middleware or security devices.
When a website returns a 200 status code, it doesn't necessarily mean that the website is accessible. You need to check the size of the response to further determine its availability.
Set a longer timeout, as sometimes the target network is slow, and sometimes our own network layer may be blocked due to excessive requests.
Strange network architecture. For example, government agency websites. Many county-level units choose to host their websites on higher-level servers. Although the domain names are different, they point to the same server. When monitoring these websites, from the perspective of the server, it appears as high-frequency requests, similar to a DDOS attack, so it may block the requests. The only solution is to use a proxy pool and switch proxies for each request.

Overall Architecture#

At first, I thought I would just write a script and be done with it. The script would only be a few kilobytes in size. But gradually, I realized that the entire architecture needed to be changed.

First of all, the company did not provide me with a proxy pool, they only provided a high-frequency dial-up server and a company exit server. So I wanted to make the most of these two machines, and decided to use load balancing.

Actually, I had only heard about load balancing before and had never actually used it. After all, my main job was to figure out how to get in, and load balancing was just another permission I needed to obtain.

Load Balancing#

After studying it carefully, I found that load balancing is actually based on reverse proxy. It forwards requests through a reverse proxy server. I looked at an article I wrote a long time ago, Single Domain and Single Port Multiple Service Web, and it was similar to load balancing.

That article was about forwarding requests to different servers based on the suffix of the URL, while load balancing forwards requests based on some algorithm. I used to think that there must be some professional service components to accomplish this, but if it is based on reverse proxy, then Nginx is naturally the best choice.

The configuration file looks something like this:

worker_processes  4; // Number of worker processes, usually the same as the number of CPU cores

events {
    worker_connections  40960; // Number of threads each process can run, mine is larger because I used the parameters from phpstudy
}

http {
    upstream web { // Load balancing nodes
        server 192.168.1.2:8000;
        server 192.168.1.3:8000;
    }

    server {
        listen 80; // Port that the reverse proxy server listens on

        location / {
            proxy_pass http://web; // Specify the node list, "web" here refers to "web" in "upstream web". Normally, this would be a normal URL.
        }
    }
}

Learning Docker#

Since it has come to this, it would be quite troublesome to install Nginx separately on the server. So I thought of using Docker. Previously, I only used Docker when using other people's things, following the commands they provided. I only knew about docker ps and docker images.

This time, I changed my approach and went from being a user to a developer, so I learned some things.

The server system is CentOS 7, so I used yum install docker to install Docker. Then I also needed to install Docker Compose (I finally remembered this word).

Interestingly, CentOS 7 cannot install Docker Compose directly using yum, but instead, I had to use pip to install Docker Compose. However, after installing it, I found that it couldn't be used. There might be some other issues. So I followed the official documentation and used wget to directly fetch an executable file.

sudo curl -L https://github.com/docker/compose/releases/download/1.21.2/docker-compose-$(uname -s)-$(uname -m) -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose

docker-compose.yaml#

I also learned about the rules of docker-compose.yaml. After all, the SQLite database cannot be used in a multi-node architecture. I needed to deploy MySQL as well. Combining these two components, using Docker Compose to configure them with a single command is simple.

version: '3'

services:
  nginx:
    image: nginx // Image name
    ports:
      - "19100:80" // Physical machine port: Container service port
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro // Overwrite the file in the container

  db:
    image: mysql
    environment:
      MYSQL_ROOT_PASSWORD: 123456
      MYSQL_DATABASE: web
    ports:
      - "3306:3306"
    volumes:
      - ./web.sql:/docker-entrypoint-initdb.d/web.sql // SQL file under docker-entrypoint-initdb.d will be automatically imported into the database

Alright, that's all I can think of for now. I encountered many problems and learned a lot. It's true what they say: the hardships you endure will become your capital in the future.

My browser is starting to lag, so I'll stop here. If there are any questions, please feel free to comment and encourage me to learn and improve. Lastly, there is actually one more issue: there are false positives in the monitoring. It reports accessible targets as being down, and I don't know why. The same target is reported as inaccessible by the Python script, but it is accessible when using curl or httpx tools. I can't figure out why. I hope the experts can give me some guidance.