Results 1 to 9 of 9

Thread: I built a new project

  1. #1
    Join Date
    Feb 2003
    Posts
    3,184
    Rep Power
    0

    Default I built a new project

    since no one seems to be doing anything new, I dusted off my editplus and build a search engine for local gov websites (link). Its basically a crawler that creates a big php array which currently has 3500+ links, titles and descriptions. This array is then stored in a cache file which is then used for the search box.
    The search has about 7 levels. It first checks if the exact word is in the array, then it checks if each of the words individually (if it is a multi-word search term). Obvious it can be improved but it works well for now. I haven't figured out how to associate "andrew" with "prime minister" instead of "st. andrew" but I will probably figure it out eventually.
    Last edited by owen; Jan 17, 2018 at 04:53 PM.

  2. #2
    Join Date
    Apr 2013
    Posts
    43
    Rep Power
    0

    Default

    Wow!!! this is pretty awesome bro, i like the concept ... it still lack few stuff and need tweaks here and there, but overtime it will get better ... this have great potential keep it up ...
    HP 450 G4 - i5 (7th Gen) 3.1Ghz | 12GB DDR4 2133Mhz | 500GB HD + 256GB M.2 NVMe | Intel HD 620 | Asus RT-N66U (Merlin) Custom Firmware | 6TB HDD (Total) | JBL Pro Desktop Speakers | Skullcandy Indy Earbuds...

  3. #3
    Join Date
    Feb 2003
    Posts
    3,184
    Rep Power
    0

    Default

    You can even compare the results to what you would get a on google but clearly google is searching the entire internet and I am basically just going through a handful of sites. About 18 or so websites totaling on my hard disc about 150 mb of just plain html. Of course I accidentally downloaded some pdfs but I figured it out quickly enough to stop that mess.

    @EthicalHacker I limit how deep I go so I might have seems a few government websites and so of the more old/spammy ones that either never get updated or are updating too often.
    Last edited by owen; Jan 17, 2018 at 12:29 PM.

  4. #4
    Join Date
    Jan 2011
    Posts
    962
    Rep Power
    14

    Default

    No links to a GitHub repo, or website URL?
    1.8 Ghz Pentium 4 (OC'd.) / Intel P4 (478) Motherboard / 800MHz DDR / 256 Mb DDR RAM / 40GB Seagate / RIVA TNT2 Pro 32MB / 24X12X24 Sony CDRW+ / 18" View Sonic CRT / Windows ME Yes it will play Doom... i plan on trying Crysis 3 one of these days.

  5. #5
    Join Date
    Feb 2003
    Posts
    3,184
    Rep Power
    0

    Default

    I linked it above but here is it again (link) its a search box for gov stuff

  6. #6
    Join Date
    Feb 2003
    Posts
    3,184
    Rep Power
    0

    Default

    I added a spell check with soundex and levenshtein using a collection of 23000 words collected by the crawler. It works pretty well except for when searching for words that are already mispelt in the word collection. I could go through and fix the collection but its a waste of time since the changes will just get overwritten the next time the crawler runs. It is good enough as is.

  7. #7
    Join Date
    Aug 2002
    Posts
    6,223
    Rep Power
    0

    Default

    Nice. I did a search for PICA and got the results I wanted. What are the # numbers to the left of the URL though?
    .
    PC - Ubuntu 15.04 64bit Desktop
    HP Pav G60-236US 3GB RAM Laptop, Ubuntu 15.04 64bit and Win7 Home

    "So Daddy, how come you telling me stealing not right when YOU copying DVDs? How come? How Come?"


    RIP Ramesh ...

  8. #8
    Join Date
    Feb 2003
    Posts
    3,184
    Rep Power
    0

    Default

    I rank the result based on the type of search I used to find it. Exact matchs in the title of url rank at 2. matches found in the url rank at something like 7. The more exact the search or the more words found the lower the rank and the higher the result is in the list. then I use a custom array sort function to order the resulting array by that column. I dont really need to show it but it helps with debugging.

  9. #9
    Join Date
    Feb 2003
    Posts
    3,184
    Rep Power
    0

    Default

    I have been updating my index on this over the weekend. Wow there are some terrible HTML5 websites out there. I had to delete 2 gigs of crap charts that one site (which I will not mention) though was a good idea to render all the data on the front end.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •